arXiv:1507.02592v2 [cs.LG] 1 Sep 2015 


Fast Rates in Statistical and Online Learning 


Tim van Erven* 

Mathematisch Instituut, Universiteit Leiden 
Leiden, 2300 RA, The Netherlands 

Peter D. Griinwald 

Centrum voor Wiskunde en Informatica and MI, Universiteit Leiden 
Amsterdam, NL-1090 GB, The Netherlands 

Nishant A. Mehta^ 

Centrum voor Wiskunde en Informatica 
Amsterdam, NL-1090 GB, The Netherlands 

Mark D. Reid 

Australian National University and NICTA 
Canberra, ACT 2601, Australia 


Peter.Grunwald@cwi.nl 


Mark . Reid@ anu .edu.au 


tim@timvanerven.nl 


mehta@cwi.nl 


Robert C. Williamson 

Australian National University and NICTA 
Canberra, ACT 2601 Australia. 


Bob.Williamson@anu.edu.au 


Vladimir N. Vapnik, Alexander J. Gammerman and Vladimir G. Vovk 


Abstract 


The speed with which a learning algorithm converges as it is presented with more data 
is a central problem in machine learning — a fast rate of convergence means less data is 
needed for the same level of performance. The pursuit of fast rates in online and statistical 
learning has led to the discovery of many conditions in learning theory under which fast 
learning is possible. We show that most of these conditions are special cases of a single, 
unifying condition, that comes in two forms: the central condition for ‘proper’ learning 
algorithms that always output a hypothesis in the given model, and stochastic mixability 
for online algorithms that may make predictions outside of the model. We show that under 
surprisingly weak assumptions both conditions are, in a certain sense, equivalent. The 
central condition has a re-interpretation in terms of convexity of a set of pseudoprobabilities, 
linking it to density estimation under misspecification. For bounded losses, we show how 
the central condition enables a direct proof of fast rates and we prove its equivalence to 
the Bernstein condition, itself a generalization of the Tsybakov margin condition, both of 
which have played a central role in obtaining fast rates in statistical learning. Yet, while the 
Bernstein condition is two-sided, the central condition is one-sided, making it more suitable 
to deal with unbounded losses. In its stochastic mixability form, our condition generalizes 
both a stochastic exp-concavity condition identified by Juditsky, Rigollet and Tsybakov and 
Vovk’s notion of mixability. Our unifying conditions thus provide a substantial step towards 
a characterization of fast rates in statistical learning, similar to how classical mixability 
characterizes constant regret in the sequential prediction with expert advice setting. 


*. Authors listed alphabetically. Preliminary versions of some parts of this work were presented at NIPS 
2012 and at NIPS 2014 (see acknowledgments on page 53). 
f. Work performed while at ANU and NICTA. 
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exp-concavity 

1. Introduction 

Alexey Chervonenkis jointly achieved several significant milestones in the theory of ma¬ 
chine learning: the characterization of uniform convergence of relative frequencies of events 
to their probabilities (Vapnik and Chervonenkis, 1971), the uniform convergence of means 
to their expectations (Vapnik and Chervonenkis, 1981), and the ‘key theorem in learning 
theory’ showing the relationship between the consistency of empirical risk minimization 
(ERM) and the uniform one-sided convergence of means to expectations (Vapnik and Cher¬ 
vonenkis, 1991); (Vapnik, 1998, Chapters). Two outstanding features of these contributions 
are that they characterized the phenomenon in question, and the quantitative results are 
parametrization independent in the sense that they do not depend upon how elements of 
the hypothesis class J- are parameterized, only on global (effectively geometric) properties 
of J-. With his co-author Vladimir Vapnik, Alexey Chervonenkis also presented quantita¬ 
tive bounds on the deviation between the empirical and expected risk as a function of the 
sample size n. These are used for the theoretical analysis of the statistical convergence of 
ERM algorithms, which are central to machine learning. According to Vapnik (1998, p. 
695), in his 1974 book co-authored by Chervonenkis (Vapnik and Chervonenkis, 1974) they 
presented ‘slow’ and ‘fast’ bounds for ERM when used with 0-1 loss. They showed that in 
the realizable or ‘optimistic’ case (where there is an / G that almost surely predicts cor¬ 
rectly, so that the minimum achievable risk is zero) one can achieve fast 0(l/n) convergence 
as opposed to the ‘pessimistic’ case where one does not have such an / in the hypothesis 
class and the best uniform bound is 0{l/^/n) (Vapnik, 1998, page 127). This difference is 
important because if one is in such a ‘fast rate’ regime, one can achieve good performance 
with less data. 

The present paper makes several further contributions along this path first delineated by 
Vapnik and Chervonenkis. We focus upon the distinction between slow and fast learning. As 
shown in the special case of squared loss by Lee et al. (1998) and log loss by Li (1999), if the 
hypothesis class is convex, one can still attain fast 0(l/n) convergence even in the agnostic 
(pessimistic) setting.^ Such convergence results, like those of Vapnik and Chervonenkis, 
are uniform — they hold for all possible target distributions. When the hypothesis class 
is not convex, one cannot attain a uniform fast bound for ERM (Mendelson, 2008a), and 
it is not known whether fast rates are possible for any algorithm at all; however, one can 
obtain a non-uniform bound (Mendelson and Williamson, 2002; Mendelson, 2008b). Such 
bounds are necessarily dependent upon the relationships between the components {l,'P,F) 
of a statistical decision problem or learning task. Here i is the loss, T the hypothesis class, 
and V the (possibly singleton) class of distributions which, by assumption, contains the 
unknown data-generating distribution. Often one can assume large classes of V and still 
obtain bounds that are relatively uniform, i.e. uniform over all P € V. We identify a 
central condition on decision problems {i,V,F) — where I may be unbounded — that, in 

1. Throughout this work, implicit in our statements about rates is that the function class is not too large; 
we assume classes with at most logarithmic universal metric entropy, which includes finite classes, VC 
classes, and VC-type classes. 
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its strongest form, allows 0(l/n) rates for so-called ‘proper’ learning algorithms that always 
output a member of J-. In weaker forms, it allows rates in between 0(l/\/n) and 0(l/n). 

As a second contribution, we connect the above line of work (within the traditional 
stochastic setting) to a parallel development in the worst-case online sequence prediction 
setting. There, one makes no probabilistic assumptions at all, and one measures conver¬ 
gence of the regret, that is, the difference between the cumulative loss attained by a given 
algorithm on a particular sequence with the best possible loss attainable on that sequence 
(Cesa-Bianchi and Lugosi, 2006). This work, due in large part to Vovk (1990, 1998, 2001), 
shares one aspect of Vapnik and Chervonenkis’ approach — it achieves a characterization 
of when fast learning is possible in the online individual sequence-setting. Since there is 
no V in this setting, the characterization depends only upon the loss i, and in particular 
whether the loss is mixable. As shown in Section 4, our second key condition, stochastic 
mixability, is a generalization of Vovk’s earlier notion. Briefly, when V is the set of all 
distributions on a domain, stochastic mixability is equivalent to Vovk’s classical mixability. 
Stochastic mixability of (£,7^,7-") for general V then indicates that fast rates are possible in 
a stochastic on-line setting, in the worst-case over all P G "P. 

The main contribution in this paper is to show, first, that a range of existing condi¬ 
tions for fast rates (such as the Bernstein condition, itself a generalization of the Tsybakov 
condition) are either special cases of our central condition, or special cases of stochastic 
mixability (such as original mixability and (stochastic) exp-concavity); and second, to show 
that under surprisingly weak conditions the central condition and stochastic mixability are 
in fact equivalent — thus there emerges essentially a single condition that implies fast rates 
in a wide variety of situations. Our central and stochastic mixability condition improve in 
several ways on the existing conditions that they generalize and unify. For example, like 
the uniform convergence condition in Vapnik and Chervonenkis’ original ‘key theorem of 
learning theory’ (Vapnik and Chervonenkis, 1991), but unlike the Bernstein fast rate con¬ 
dition, our conditions are one-sided which, as forcefully argued by Mendelson (2014), seems 
as it should be; Example 5.7 explains and illustrates the difference between the two- and 
one-sided conditions. Like Vapnik and Chervonenkis’ uniform convergence condition and 
Vovk’s classical mixability, but unlike the stochastic and individual-sequence exp-concavity 
conditions, our conditions are parametrization independent (Section 4.2.2). Finally, unlike 
the assumptions for classical mixability (Vovk, 1998), we do not require compactness of 
the loss function’s domain. We hasten to add though that for unbounded losses, several 
important issues are still unresolved — for example, if under some P € V and with some 
f € J- the distribution of the loss has polynomial tails, then some of our equivalences break 
down (Section 5.2). 

One final historical precursor deserves mention. Statistical convergence bounds rely on 
bounds on the tails of certain random variables. In Section 7 we show how, for bounded 
losses, the central condition (4) directly controls the behaviour of the cumulant generating 
function of the excess loss random variable. The geometric insight behind this result. 
Figure 3, previously was used, unbeknownst to us when carrying out the work originally 
(Mehta and Williamson, 2014), by Claude Shannon (1956). It is fitting that our tribute to 
Alexey Chervonenkis can trace its history to another such giant of the theory of information 
processing. 
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1.1 Why Read This Paper? Our Most Important Results 

Below, we highlight the core contributions of this work. A more comprehensive overview is 
in Section 2 and the diagram on page 6, which summarizes all results from the paper. 

• We introduce the v-stochastic mixability condition on decision problems (Equation (8), 

Definition 4.1 and 5.9), a strict generalization of Vovk’s classical mixability (Vovk, 
1990, 1998, 2001; van Erven et ah, 2012a), exp-concavity (Kivinen and Warmuth, 
1999; Cesa-Bianchi and Lugosi, 2006) and stochastic exp-concavity, a condition iden- 
tihed implicitly by Juditsky et al. (2008) and used by e.g. Dalalyan and Tsybakov 
(2012). Here v : —>■ is a nondecreasing nonnegative function. In the important 

special case that v = p is constant, we say that (strong) stochastic mixability holds. 
Proposition 4.5 shows that in that case, with finite J-, Vovk’s aggregating algorithm 
for on-line prediction in combination with an online-to-batch conversion achieves a 
learning rate of 0(l/n); if the u-condition holds for sublinear v with u(0) = 0, inter¬ 
mediate rates between 0{l/^/n) and 0(l/n) are obtained. These results hold under 
no further conditions at all, in particular for unbounded losses. Interest: the condition 
being a strict generalization of earlier ones, it shows that we can get fast rates for 
some situations for which this was was hitherto unknown. 

• We introduce the v-central condition (Equations (4), (5), (6), (10), Definition 3.1 
and 5.3). As we show in Theorem 5.4, for bounded losses and v of the form v{x) = 
Cx", it generalizes the Bernstein condition (Bartlett and Mendelson, 2006), itself 
a generalization of the Tsybakov margin condition (Tsybakov, 2004). If u = r/ is 
constant, we just say that the (strong) central condition holds. In that case, with 
(unbounded) log-loss, it generalizes a (typically nameless) condition used to obtain 
fast rates in Bayesian and minimum description length (MDL) density estimation in 
misspecification contexts (Li, 1999; Zhang, 2006a,b; Kleijn and van der Vaart, 2006; 
Griinwald, 2011; Griinwald and van Ommen, 2014). These are all conditions that allow 
for fast rates for proper learning, in which the learning algorithm always outputs an 
element of J-. 

(i) Eor convex T", we prove that the strong r/-central condition and the strong rj- 

stochastic mixability are equivalent, under weak conditions (Theorem 4.17 in conjunc¬ 
tion with Proposition 4.11 and Theorem 3.10 in conjunction with Proposition 4.12). 
Interest: This shows that existing fast rate conditions for 0(l/n) rates in online 

learning are related to fast rate conditions for 0(l/n) rates for proper learning algo¬ 
rithms such as ERM — even though such conditions superficially look very different 
and have very different interpretations: existence of a ‘substitution function’ (mixa¬ 
bility) vs. the exponential moment of a loss difference constituting a supermartingale 
(central condition). 

(ii) We prove (a) that for bounded losses, the strong central condition always 
implies fast 0(l/n) rates for ERM and the u-central condition implies intermediate 
rates (Theorem 7.6). The equivalence between r/-mixability and the central condition 
and Proposition 4.5 mentioned above imply that, (b), the central condition implies 
fast rates in many more conditions, even with unbounded losses. We also show (c) that 
there exist decision problems with unbounded losses in which the central condition 
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holds, the Bernstein condition does not hold, and we do get fast rates. Interest: 
first, while fast and intermediate rates under the ii-central condition with bounded 
loss can also be derived from existing results, our proof is directly in terms of the 
central condition and yields better constants. Second, results (a)-(c) above lead us to 
conjecture that there exist some very weak condition (much weaker than bounded loss) 
such that for sublinear v, the u-central condition together with this extra condition 
always implies sublinear rates. Establishing such a result is a major goal for future 
work. 

• Under mild conditions, the u-central condition is equivalent to a third condition, the 
pseudoprobability convexity (PPC) condition — (7) and Definition 3.2 and 5.3. In¬ 
terest: for the constant v = p case (0(l/n) rates), the PPC condition provides a 
clear geometric and a data-compression interpretation of the u-central condition. For 
bounded losses and general u, it implies that a problem must have unique minimiz- 
ers in a certain sense (Proposition 5.11), giving further insight into the fast rates 
phenomenon. 

• In some cases with nonconvex ERM and other proper learning algorithms achieve a 
suboptimal 0{l/y/n) rate, whereas online methods combined with an online-to-batch 
convergence get 0(l/n) rates in expectation (Audibert, 2007). Now the implication 
‘strong stochastic mixability strong central condition ’ (Theorem 3.10 in conjunc¬ 
tion with Proposition 4.12, already mentioned under 2(i)) holds whenever the risk 
minimizer within T coincides with the risk minimizer within the convex hull of T. 
Thus, as long as this is the case, there is no inherent rate advantage in improper 
learning — if r/-stochastic mixability holds so that (improper) online methods achieve 
an 0(l/n)-rate, so will the (proper) ERM method. Theorem 7.6 implies this for 
bounded losses; we conjecture that the same holds for unbounded losses. Interest: 
This insight helps understand when improper learning can and cannot be helpful for 
general losses, something that was hitherto only well-understood for the squared loss 
on a bounded domain (Lecue, 2011). 


2. Introduction to and Overview of Results 

To facilitate reading of this long paper, we provide an introductory summary of all our 
results. By reading this section alongside the ‘map’ of conditions and their relationships 
on page 6, the reader should get a good overview of our results. We start below with some 
notational and conceptual preliminaries, and continue in Section 2.2 with a discussion of 
the central condition, followed by a section-by-section description of the paper. 

2.1 Decision problems and Risk 

We consider decision problems which, in their most general form, can be specified as a 
four-tuple (£, "P, T", Td) where P is a set of distributions on a sample space Z, and the goal 
is to make decisions that are essentially as good as the best decision in the model T {T is 
often called an ‘hypothesis space’ in machine learning). We will allow the decision maker 
to make decisions in a decision set Pd which is usually taken equal to, or a superset of. 
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T but for mathematical convenience is also allowed to be a subset of T. The quality of 
decisions will be measured by a loss function x Z ^ [—i?,cx)] for arbitrary i? > 0 

where a smaller loss means better predictions, and O T" U Td is the domain of the loss. 
As further notation we introduce the component functions if{z) = i{f, z) and for any set G 
we let A(^) denote the set of distributions on G (implicitly assuming that ^ is a measurable 
set, equipped with an appropriate cr-algebra). A loss function £ is called bounded if for some 
B > 0, for all / G and all P € V, we have \if{Z)\ < B almost surely when Z ^ P. 
When Pi is a set for which this is well-defined, for any P C p£ we denote by co{P) C P^ 
the convex hull of P . 

Now fix some decision problem (£, "P, P, Pd)- The risk of a predictor f € Pi with respect 
to P G P is defined, as usual, as 


R{PJ)= B [if{Z)], (1) 

where Z is a random variable mapping to outcomes in Z and, in general, P(P, /) may be 
infinite. However, for the remainder of the paper we will only consider tuples {i,V,P,Pd) 
such that for all P G P, there exists^ at least one f° £ P with R{P, f°) < 00 and hence 
P{if°{Z) = 00) = 0. A learning algorithm or estimator is a (computable) function from 
Un>0'2”' to Pd that, upon observing data Zi,...,Z„, outputs some /„ G Pd. Following 
standard terminology, we call a learning algorithm proper (Lee et al., 1996; Alekhnovich 
et al., 2004; Urner and Ben-David, 2014) if its outputs are restricted to the set P, i.e. P = 
Pd. Examples of this setting, which has also been called in-model estimation (Griinwald 
and van Ommen, 2014), include ERM and Bayesian maximum a posteriori (MAP) density 
estimation. For notational convenience, in such cases we identify a decision problem with the 
triple (£, P, P). We only consider P 7 ^ Pd in Section 4 and 6 on on-line learning, where Pd is 
often taken to be co(P); for example, P may be a set of probability densities (Example 2.2) 
and the algorithm may be Bayesian prediction, which predicts with the Bayes predictive 
distribution (Section 3.3), a mixture of elements of P which is hence in co(P). One of 
our main insights, discussed in Section 4.3.3, is understanding when the weaker conditions 
that allow fast rates for improper learning transfer to the proper learning setting. In the 
stochastic setting, the rate (in expectation) of a learning algorithm is the quantity 


sup 

P&V 


E 

z~p 


R{P. In) 


mfP(P,/) j, 


( 2 ) 


where Z = (Zi,..., Z„) are n i.i.d. copies of Z. The rate of a learning algorithm can usually 
be bounded, up to logn factors, as (cOMP„(P)/n)“ for some a between 1/2 and 1. Here 
COMP„(P) is some measure of the complexity of P which may or may not depend on n, such 
as its codelength, its VC-dimension in classification, an upper bound on the KL-divergence 
between prior and posterior in PAC-Bayesian approaches, or the logarithm of the number 
of elements of an e-net, with e determined by sample size, and so on. In the simplest case, 
with P finite, complexity is invariably bounded independently of n (usually as log |P|), and 

2. We allow the loss itself to be infinite which makes random variables and their expectations undefined 
when they evaluate to 00 — 00 with positive probability. The requirement that /° exists for all P ensures 
that we never encounter this situation in any of our formulas. 
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whenever for a decision problem (£, "P, P, Pd) with finite T there exists a learning algorithm 
achieving the rate 0 (l/n), we say that the problem allows for fast rates. 

In the remainder of this section we make the following simplifying assumption. 

Assumption A (Minimal Risk Achieved) For all P € V, the minimal risk R{P, f) 
over P is achieved by some f*GP depending on P, i.e. 

R{P,n= inf R{P,f). (3) 

J 

Assumption A is essentially a closure property that holds in many cases of interest. We 
will call such f* P-optimal for P or simply P-optimal. When P £ V and P are clear from 
context, we will also simply say that f* is the best predictor. 

Example 2.1 (Regression, Classification, (Relatively) Well-Specified and Mis- 
specified Models) In the standard statistical learning problems of classification and re¬ 
gression, we have Z = X x y for some ‘feature’ or ‘covariate’ space A and P is a set of 
functions from A to y. In classification, y = {0,1} and one usually takes the standard 
classification loss Pj^^{{x,y)) = \y — f{x)\', in regression, one takes A = IK and the squared 
error loss fj^[{x,y)) = \^{y — /(x))^. In Example 2.2 we show that density estimation also 
fits in our setting. For losses with bounded range [0, B], if the optimal f* that exists by As¬ 
sumption A has 0 risk, we are in what Vapnik and Chervonenkis (1974) call the ‘optimistic’ 
setting, more commonly known as the ‘deterministic’ or ‘realizable’ case (VC in Figure 1 
on page 6 ). We never make this strong an assumption and are thus always in the ‘agnostic’ 
case. A strictly weaker assumption would be to assume that f* is the Bayes decision rule, 
minimizing the risk R{P, f*) over the loss function’s full domain P^; in classification this 
means that f* is the Bayes classifier (minimizing risk over all functions from A to V), in re¬ 
gression it implies that f* is the true regression function, i.e. f*{x) = E(x,r)~p[V \ X = x\, 
in density estimation (see below) that f* is the density of the ‘true’ P. Borrowing termi¬ 
nology from statistics, we then say that the model P is well-specified, or simply correct. 
Although this assumption is often made in statistics and sometimes in statistical learning 
(e.g. in the original Tsybakov condition (Tsybakov, 2004) and in the analysis of strictly 
convex surrogate loss functions for 0/1-loss (Bartlett et ah, 2006)), all of our results are 
applicable to incorrect, misspecified P as well. We will, however, in some cases make the 
much weaker Assumption B (page 28) that P is well-specified relative to Pa, or equivalently 
P is as good as Pd, meaning that for all P G P, minygj-^ RiP, f) = min/gp i?(P, /). In all 
our examples, if P / Pd we can take, without loss of generality, Pd = co(P), and then a 
sufficient (but by no means necessary) condition for relative well-specification is that P is 
either convex or correct. ■ 


We now turn to an overview of the main results and concepts of this paper, which are also 
highlighted in Figure 1 on page 6. 

2.2 Main Concept: The Central Condition 

We focus on decision problems {i,V,P) satisfying the simplifying Assumption A by fixing 
any such decision problem and letting P £ V and f* be P-optimal for P. We may now 
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ask this f* to satisfy a stronger, supermartingale-type property where for some rj > 0 we 
require 


E 




< 1 


for all / G -F. 


( 4 ) 


This type of property plays a fundamental role in the study of fast rates because it controls 
the higher moments of the negated excess loss if* (Z) —£f(Z}. Note that by our conventions 
regarding infinities (Section 2.1) this implies that P{if*{Z) = oo) = 0. 

There are several motivations for studying the requirement in (4). In the case of clas¬ 
sification loss, it can be seen to be a special, extreme case of the Bernstein condition (see 
below). In the case of log loss, the requirement becomes a standard (but usually unnamed) 
condition which we call the Bayes-MDL Condition which is used in proving convergence 
rates of Bayesian and MDL density estimation (Example 2.2). Finally, under a bounded 
loss assumption the condition (4) implies one our main results. Theorem 7.6, a fast rates 
result for statistical learning over hnite classes (the situation for unbounded losses is more 
complicated and is discussed after Example 2.2). 

Note that to satisfy Assumption A it is sufficient to require that the property (4) holds 
for some f* £ iF since, by Jensen’s inequality, this f* must then automatically be F'-optimal 
as in (3). We will require (4) to hold for all P £ V (where f* may depend on P). This is 
the simplest form of our central condition, which we call the the rj-central condition. We 
note that if (4) holds for all / G F" then it must also hold in expectation for all distributions 
on F. Thus, the ry-central condition can be restated as follows: 


VFgF3/* GFVBg A(F) : 


E E 

Z'^p /~n 




< 1 . 


( 5 ) 


This rephrasing of the central condition will be useful when comparing it to conditions 
introduced later in the paper. 

The central condition is easiest to interpret for density estimation with the logarithmic 
loss. In this case the condition for r] = 1 Is implied by F being either well-specified or 
convex, as the following example shows. 


Example 2.2 (Density estimation under well-specified or convex models) Let F 

be a set of probability densities on Z and take i to be log loss, so that if{z) = — log f{z). 

For log loss, statistical learning becomes equivalent to density estimation. Satisfying the 
central condition then becomes equivalent to, for all P £ V, finding an /* G F such that 


E 

z^pyf*{z) 


V 

< 1 


( 6 ) 


for all / G F. If the model F is correct, it trivially holds that {i, V, F) satisfies the 1-central 
condition as we choose f* to be the density of P, so that the densities in the expectation 
and the denominator cancel. Even when the model is misspecified, Li (1999) showed that 
(6) holds for rj = 1 provided the model is convex. We will recover this result in Exam¬ 
ple 3.12 in Section 3, where we review the central role that (6) plays in convergence proofs 
of MDL and Bayesian estimation. Even if the set of densities is neither correct nor convex, 
the central condition often still holds for some rj ^ 1. In Example 3.6 we explore this for 
the set of normal densities with variance when the true distribution is either Gaussian 


9 







with a different variance, or subgaussian. 


We show in Section 7 that for bounded losses the ry-central condition implies fast 0(l/n) 
rates for finite J-. But what about unbounded losses such as log loss? In the log loss/density 
estimation case, as shown by Barron and Cover (1991); Zhang (2006a); Griinwald (2007) 
and others, fast rates can be obtained in a weaker sense. Specihcally, in the worst-case over 
P £ V, the squared Bellinger distance or Renyi divergences between /„ and the optimal 
f* converge as 0(l/n) for ERM when P is finite, and like 0(cOMP„/n) for general P and 
for 2-part MDL and Bayes MAP-style algorithms. If the goal is to obtain fast rates in 
the stronger sense (2) for general unbounded loss functions some additional assumptions 
are needed. Zhang (2006a,b) provides such results for penalized ERM and randomized 
estimators (see also the discussion in Section 8). Importantly, as explained by Griinwald 
(2012), the proofs for fast rates in all the works mentioned here crucially, though sometimes 
implicitly, employ the r/-central condition at some point. 


2.3 Overview of the Paper 

Section 3 — Fast Rates for Proper Learning: PPG Condition, Bayesian Interpretation, 
Relation to Bayes-MDL Condition. 

In Section 3, we give a second condition, the pseudoprobability convexity (PPC) condition, 
a variation of (5) stating that: 


VP G P Vn e A(P) 3f* £P: E [if*{Z)] < E 


-^log E 
V /~n 


( 7 ) 


Clearly, if the condition holds, then it will hold by choosing, for every P ^ V, f* to be 
P-optimal relative to P. The name ‘pseudoprobability’ stems from the interpretation of 
Pf{Z) := as ‘pseudo-probability associated with /, similar to the ‘entropification’ of 

/ introduced by Griinwald (1999). The full ‘pseudoprobability convexity’ stems from the in¬ 
terpretation illustrated by and explained around Figure 2 on page 21. We show that, under 
simplifying Assumption A, the central and PPC conditions are equivalent. One direction of 
this equivalence is trivial, while the other direction is our first main result. Theorem 3.10. 
We also explain how the rightmost expression in (7) strongly resembles the expected log-loss 
of a Bayes predictive distribution, and how this leads to a ‘pseudo-Bayesian’ or ‘pseudo¬ 
data compression’ interpretation of the pseudoprobability convexity condition, and hence of 
the central condition. Versions of this interpretation were highlighted earlier by Griinwald 
(2012); Griinwald and van Ommen (2014). Thus, we can think of both conditions as a single 
condition with dual interpretations: a frequentist one in terms of exponentially small devia¬ 
tion probabilities (which follow by applying Markov’s inequality to 

and a pseudo-Bayesian one in terms of convexity properties of P. Further, we give a few 
more examples of the central/PPC condition in this section, and we discuss in detail its 
special case, the Bayes-MDL condition (Example 2.2). 

Crucially, all algorithms that we are aware of for which fast rates have been proven 
by means of the r/-central condition are ‘proper’ in that they always output a (possibly 
randomized) element of P itself. This includes ERM, two-part MDL, Bayes MAP and 
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randomized Bayes algorithms (Barron and Cover, 1991; Zhang, 2006a,b; Griinwald, 2007) 
and PAC-Bayesian methods (Audibert, 2004; Catoni, 2007). Thus, the central condition is 
appropriate for proper learning. This is in contrast to the stochastic mixability condition 
which is defined and studied in Section 4. 


Section 4 — Fast Rates for Online Learning: (Stochastic) Mixability and Exp-Concavity. 


In online learning with bounded losses, strong convexity of the loss is an oft-used condition 
to obtain fast rates because it is naturally related to gradient and mirror descent methods 
(Kazan et ah, 2007, 2008; Shalev-Shwartz and Singer, 2007). If we allow more general algo¬ 
rithms, however, then fast rates are also possible under the condition of exp-concavity which 
is weaker than strong convexity (Kazan et al., 2007). Exp-concavity in turn is a special case 
of Vovk’s classical mixability condition (Vovk, 2001), the main difference being that the defi¬ 
nition of exp-concavity depends on the choice of parametrization of the loss function whereas 
the definition of classical mixability does not. Whether classical mixability can really be 
strictly weaker than exp-concavity in an ‘optimal’ parametrization is an open question (Ka- 
malaruban et al., 2015; van Erven, 2012). Strong convexity, exp-concavity and classical 
mixability are all individual sequence notions, allowing for fast rates in the sense that, if F 
is finite, then there exist (improper) learning algorithms for which the worst-case cumulative 

regret over all sequences, that is sup^^^ |X]r=i ~ 

is bounded by a constant. This implies that the worst-case cumulative regret per outcome 
at time n is 0(l/n). 

One may obtain learning algorithms for statistical learning by converting algorithms for 
online learning using a process called online-to-batch conversion (Cesa-Bianchi et al., 2004; 
Barron, 1987; Yang and Barron, 1999). This process preserves rates, in the sense that if 
the worst-case regret per outcome at time n of a method is then the rate of the resulting 
learning algorithm in the sense of (2) will also be r„. Kowever, for this purpose, it suffices to 
use a much weaker stochastic analogue of mixability that only holds in expectation instead 
of holding for all outcomes. This analogue is rj-stochastic mixability, which we define (note 
the similarity to (7)) as 


VK G A(^) 3r G -Ed VP G P : E [^^.(Z)] < E 

Zd'^r 


--log E 
T] /~n 


( 8 ) 


Under this condition, Vovk’s Aggregating Algorithm (AA) achieves fast rates in expecta¬ 
tion under any P G P in sequential on-line prediction, without any further conditions on 
(£,P,P,Pd); in particular there are no boundedness restrictions on the loss. If we take V 
to be the set of all distributions on Z, we recover Vovk’s original individual-sequence 77 - 
mixability. Note that, based on data Z\, ..., Z„, the AA outputs / that are not necessarily 
in P but can be in some different set Pd (in all applications we are aware of, Pd = co(P), 
the convex hull of P). Online-to-batch conversion has been used, amongst others, by Judit- 
sky et al. (2008); Dalalyan and Tsybakov (2012) and Audibert (2009) to obtain fast rates 
in model selection aggregation. In Sections 4.2.3 and 4.2.4 we relate their conditions to 
stochastic mixability. We show that results by Juditsky et al. (2008) employ a stochastic 
exp-concavity condition, a special case of our stochastic mixability condition, in a manner 
similar to the way exp-concavity is a special case of classical mixability. Given these appli¬ 
cations to statistical learning, it is not surprising that stochastic mixability is closely related 
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to the conditions for statistical learning discussed above. We will show in Proposition 4.12 
that under certain assumptions it is equivalent to our central condition (5) and hence also 
the PPG condition (7). The proposition shows that this holds unconditionally in the proper 
learning setting: stochastic mixability implies the pseudoprobability convexity condition 
which, in turn, implies the central condition under some weak restrictions. The proposi¬ 
tion also gives a condition under which these relationships continue to hold in the more 
challenging case when T ^ In general, making predictions in Td gives more power, 
and the central condition can only be used to infer fast rates for proper learning algorithms 
which always play in T. Thus, if r/-stochastic mixability for implies r/-PPC 

for then there is no rate improvement for learning algorithms that are allowed to 

predict in Tj instead of J-. Proposition 4.12 gives a central insight of this paper by showing 
that this implication holds under Assumption B: rj-stochastic mixability for (£, P, Jd) 
implies the rj-PPC and r]-central conditions for whenever T is well-specified rel¬ 

ative to Td — relative well-specification was defined in Example 2.1, where we indicated 
that this a much weaker condition than mere correctness of P; in all cases we are aware of, 
a sufficient condition is that J- is convex. In Example 4.13 we explore the implications of 
Proposition 4.12 for the question whether fast rates can be obtained both in expectation 
and in probability — as is the case under the central condition — or only in expectation 
— as is sometimes the case under stochastic mixability. 

For the implication from the central condition to stochastic mixability, we first define 
an intermediate, slightly stronger generalization of classical mixability that we call the ry- 
predictor condition, which looks like the central condition, but with its universal quantifiers 
interchanged: 


vn e A(P) 3f* e Pd VP G P : 


E E 

Z'^p /~n 




< 1 . 


(9) 


In our second main result, Theorem 4.17, we show that the central condition implies the pre¬ 
dictor condition whenever the decision problem satisfies a certain minimax identity, which 
holds under Assumption C or its weakening Assumption D. And since (by a trivial appli¬ 
cation of Jensen’s inequality) the predictor condition in turn implies stochastic mixability, 
we come full circle and see that, under some restrictions, all four of our conditions in the 
‘central quadrangle’ of Figure 1 (page 6) are really equivalent. 


Section 5 — Intermediate Rates: Weakening to u-central condition, connection to 
Bernstein and Tsybakov Conditions — can be read independently from Section 4. 

In Section 5, we weaken the ry-central condition to a condition which we call the u-central 
condition: rather than requiring that a fixed ry exists such that (4) holds, we only require 
that it holds (for all P G P) up to some ‘slack’ e, where we require that the slack must go 
to 0 as ry0. Specifically, we require that there is some increasing nonnegative function v 
such that 


E 




< 


for all / G P, all e > 0, with rj := v{s). (10) 


As shown in this section (Example 5.5), the u-central condition is associated with rates of 
order w{C/n) where C > 0 is some constant, and w is the inverse of x i—>■ xv{x) — taking 
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constant v(x) = rj we see that this generalizes the situation for the ry-central condition 
which for fixed rj allows rates of order 0(l/n). In our third main result, Theorem 5.4, we 
then show that, for bounded loss functions, this condition is equivalent to a generalized 
Bernstein condition (see Definition 5.2), which itself is a generalization of the Tsybakov 
margin condition (Tsybakov, 2004) to classification settings in which T may be misspecified, 
and to loss functions different from 0/1-loss (Bartlett and Mendelson, 2006). Specifically, 
for given function u, a decision problem satisfies the u-central condition if and only if it 
satisfies the tt-generalized Bernstein condition for a function 

u(x)x-^, (11) 

v[x) 

where for functions a, b from [0, oo) to [0, oo), a{x) x b{x) denotes that there exist constants 
c, C > 0 such that, for all x > 0, ca{x) < b{x) < Ca{x). 

Example 2.3 (Classification) Let represent a classification problem with i the 

0/1-loss that satisfies the u-central condition for i;(x) x x^~^, 0 < /3 < 1. Then (11) holds 
with u of form u{x) = Bx^. This is equivalent to the standard (/3, B)-Bernstein condition 
(which, if T is well-specified, corresponds to the Tsybakov margin condition with exponent 
/3/(l — ,0)), which is known to guarantee rates of O . This is consistent with the 

rate w{C/n) above, since if v{x) x x^~^, then its inverse w satisfies w{x) x ■ 


For the case of unbounded losses, the generalized Bernstein and central conditions are not 
equivalent. Example 5.7 gives a simple case in which the Bernstein condition does not hold 
whereas, due to its one-sidedness, the central condition does hold and fast rates for ERM 
are easy to verify; Example 5.8 shows that the opposite can happen as well. 

In this section we also extend ry-stochastic mixability to u-stochastic-mixability and 
show that another fast-rate condition identified by Juditsky et al. (2008) is a special case. 
For unbounded losses, the u-stochastic mixability and the u-central condition become quite 
different, and it may be that the u-Bernstein condition does imply u-mixability; whether 
this is so is an open problem. Finally, using Theorem 5.4, we characterize the relationship 
between the ry-central condition and the existence of unique risk minimizers for bounded 
losses. 

Section 6 —From Actions to Predictors. 

The classical mixability literature usually considers the unconditional setting where obser¬ 
vations and actions are points from Z and A, respectively. For example, one may consider 
the squared loss with ta{y) = {y — o,)'^ for y,a G [0,1]. It is often easy to establish stochastic 
mixability for a decision problem in this unconditional setting. An interesting question 
is whether this automatically implies that stochastic mixability (and hence, under further 
conditions, also the central condition) holds in the corresponding conditional setting where 
Z = X xy and the decision set contains predictors f ■. X ^ A that map features x G X to 
actions. Here, an example loss function might be f^^{{x,y)) = ^{y — /(x))^ as considered 
in Example 2.1. In this section, we show that the answer is a qualified ‘yes’ — in general, 
the set Td may need to be a large set such as A ^, but with some additional assumptions it 
remains manageable. 
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Section 7 — Fast Rate Theorem. 


In Section 7, we show how for bounded losses the central condition enables a direct proof 
of fast rates in statistical learning over finite classes. The path to our fast rates result, 
Theorem 7.6, involves showing that, for each function / G T", the central condition implies 
that the empirical excess loss of / exhibits one-sided concentration at a scale related to 
the excess loss of /. This one-sided concentration result is achieved by way of the Cramer- 
Chernoff method (Boucheron et ah, 2013) combined with an upper bound on the cumulant 
generating function (CGF) of the negative excess loss of / evaluated at a specihc point. 
The upper bound on the CGF is given in Theorem 7.3 which shows that if the absolute 
value of the excess loss random variable is bounded by 1, its CGF evaluated at some —rj < 0 
takes the value 0, and its mean jjL is positive, then the central condition implies that the 
CGF evaluated at —?y/2 is upper bounded by a universal constant times —By way of a 
careful localization argument, the fast rates result for hnite classes also extends to VG-type 
classes, as presented in Theorem 7.7. 

Final Section — Discussion. 

The paper ends with a discussion of what has been achieved and a list of open problems. 

3. The Central Condition in General and a Bayesian Interpretation via 
the PPC Condition 

In this section we first generalize the dehnitions of the central and pseudoprobability con¬ 
vexity (PPC) conditions beyond the case of the simplifying Assumption A. We give a few 
examples and list some of their basic properties. We then show that the central condition 
trivially implies the PPC condition, under no conditions on the decision problem at all. 
Additionally, in our first main theorem, we show that if Assumption A holds or the loss is 
bounded, then the converse result is also true. Importantly, this equivalence between the 
central condition and the PPC condition allows us to interpret the PPC condition as the 
requirement that a particular set of pseudoprobabilities is convex on the side that ‘faces’ the 
data-generating distribution P (Figure 2). This leads to a (pseudo)-Bayesian interpretation, 
which says that the (pseudo)-Bayesian predictive distribution is not allowed to be better 
than the best element of the model. 

3.1 The Central and Psendoprobability Convexity Conditions in General 

We now extend the dehnition (4) of the central condition to the case that our simplifying 
Assumption A may not hold. In such cases, it may be that there is no fixed comparator 
that satisfies (4) , but there does exist a sequence of comparators /i, /|,... that satisfies 
(5) in the limit. By introducing a function f that maps P to f* this leads to the following 
dehnition of the general ry-central condition: 


Definition 3.1 (Central Condition) Letrj > 0 ands > 0. We say that satisfies 

the ry-central condition up to e if there exists a comparator selection function fi: V ^ P 
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such that 


E E 

/~n 




< 


for all P ^ V and distributions 11 G A(J^). (12) 


If it satisfies the ij-central condition up to 0, we say that the strong r/-central condition or 
simply the 77 -central condition holds. If it satisfies the rj-central condition up to £ for all 
£ > 0, we say that the weak ? 7 -central condition holds; this is equivalent to 


sup inf sup E E 
PeP/*ePneA(p) 


\,'n{if*{z)-ef(z)) 


< 1 . 


(13) 


Note that we explicitly identify the situation in which the condition does not actually hold 
in the strong sense but will if some slack e > 0 is introduced. We will do the same for 
the other fast rate conditions identified in this paper, and we will also establish relations 
between the ‘up to e > 0’ versions. This will become useful throughout Section 5 and, in 
particular. Section 5.3. 

The PPG condition generalizes analogously to the central condition and features 


mP^{z) 


1 

7 / 


log E 
/~n 




(14) 


a quantity that plays a crucial role in the analysis of online learning algorithms (Vovk, 1998, 
2001), (Cesa-Bianchi and Lugosi, 2006, Theorem 2.2) and has been called the mix loss in 
that context by De Rooij et al. (2014). 


Definition 3.2 (Pseudoprobability convexity condition) Let rj > 0 and e > 0. We 

say that {£, P, P) satisfies the ry-pseudoprobability convexity condition up to e if there exists 
a funetion fr. P ^ P such that 


E [^^(p)(Z)] < E [m'}^{Z)]+£ for all P £P andUG A{P). (15) 

If it satisfies the p-pseudoprobability convexity condition up to 0, we say that the strong 
77 -pseudoprobability convexity condition or simply the 77 -pseudoprobability convexity con¬ 
dition holds. If it satisfies the p-pseudoprobability convexity condition up to £ for all e > 0, 
we say that the weak r 7 -pseudoprobability convexity condition holds; this is equivalent to 

sup sup inf E [£f{Z) — wfl^{Z)\ < 0. (16) 

neA(P) PGP 


Under Assumption A this condition simplifies and implies the essential uniqueness of optimal 
predictors (cf. Section 3.3). 

Proposition 3.3 (PPG condition implies uniqueness of risk minimizers) Suppose 
that Assumption A holds, and that {i,P,P) satisfies the weak p-pseudoprobability convexity 
condition. Then it also satisfies the strong p-pseudoprobability convexity condition, and for 
all P G P, the P-optimal f* satisfying (3) is essentially unique, in the sense that, for any 
g* G P with R{P,g*) = R{P,f*), we have that ig*{Z) = if*{Z) holds P-almost surely. 
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Proof Assumption A implies that if (15) holds at all, then it also holds with </>(P) equal 
to any J^-risk minimizer f* as in (3). Thus, if it holds for all e > 0, it holds for all e > 0 
with the fixed choice /*, and hence it must also hold for e = 0 with the same f*. 

As to the second part, consider a distribution II that puts mass 1/2 on f* and 1/2 on 
g* . Then the strong ry-pseudoprobability condition implies that 


min E \if(Z)] < E 

f&TZr^P ’’ Zr^P 


< E 

Zr^P 


L T] ^^2 2 ’ 


+ -ig*{Z) 


= .0B^[i,iZ)], 


where we used convexity of — log and Jensen’s inequality. Hence both inequalities must 
hold with equality. By strict convexity of — log, we know that for the second inequality this 
can only be the case if /* = ig* almost surely, which was to be shown. ■ 

Finally, we will often make use of the following trivial but important fact. 


Fact 3.4 Fix g > 0,£ > 0 and let be an arbitrary decision problem that satisfies 

the g-central condition up to e. Then for any 0 < g' < g and any e' > £ and for any 
V C V, satisfies the g'-central condition up to . The same holds with ‘central’ 

replaced by ‘PPC’. 


We proceed to give some examples. 

Example 3.5 (Squared Loss, Unrestricted Domain) Consider squared loss t^{z) = 
^{z — f)"^ with Z = F = M., and let V = {AA(^, 1) : ^ G M} be the set of normal distributions 
with unit variance and arbitrary means p. Estimating the mean of a normal model is a 
standard inference problem for which a squared error risk of order 0(l/n) is obtained by 
the sample mean. We would therefore expect the central condition to be satisfied and, 
indeed, this is the case for g < 1 via a reduction to Example 2.2. To see this, consider 
the well-specified setting for the log loss with densities f'^F' = V, and note that the 
squared loss for / equals the log loss for /' up to a constant when / is the mean of /': 

tp{z) = — loge“^^“'^^^/^ = — log 

Since the log loss satisfies the 1-central condition in the well-specified case (see Example 2.2), 
the squared loss must also satisfy the 1-central condition. ■ 


Not surprisingly, the central condition still holds if we replace the Gaussian assumption by 
a subgaussian assumption. 

Example 3.6 For > 0 let P ^2 be an arbitrary subgaussian collection of distributions 
over M. That is, for all t G M and P G 'P ^2 


E 


^t(Z-fip) 


< e^ 


2*2/2 


(17) 
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where /rp = E^^p[Z] is the mean of Z. Now consider the squared loss = ^{z — /)^ 

again, with = Z = M.. Then 

if{z)-fp{z) = ^5{2{z-f)-5), where (5 =/'-/. (18) 


Taking f = fip gives 


E 




Zr^P 


gVS{Z-^lp) 


< g-^<5^/2gO-27?2<52/2^ 


(19) 


The right-hand side is at most 1 if r/ < 1/cr^, and hence to satisfy the strong //-central 
condition with substitution function 4>{P) = /Up, it suffices to take t] < \/a^. Note that 
(j) maps P to the T'-optimal predictor for V — a fact which holds generally, as shown in 
Proposition 3.3 above. Note also that, just like Example 3.5, the example can be reduced 
to the log-loss setting in which the densities are all normal densities with means in M and 
variance equal to 1. In Example 5.8 we shall see that if V contains P with polynomially 
large tails, then the //-central condition may fail. ■ 


Example 3.7 (Subgaussian Regression) Examples 2.2, 3.5 and 3.6 all deal with the 
unconditional setting (cf. page 13) of estimating a mean without covariate information. The 
corresponding conditional setting is regression, in which T" is a set of functions / : T —)• T, 
Z = T’xT, T = K and y)) := Analogously to Example 3.6, fix > 0 and 

let "P be a set of distributions on A x T such that for each P G P and x G X, P{Y \ X = x) is 
subgaussian in the sense of (17). Now consider a decision problem (P'^s, V, P). Example 3.6 
applies to this regression setting, provided that, for each P gV, the model F contains the 
true regression function fp{x) := E(x,y)~p[E \ X = x\. To see this, note that then for all 
P GV, all f GF, 


E 

{x,yFp 




= E E 

P{X) P{Y\X) 




< E 

P{X) 


p-vYl2^FF5^l2 


< 1, 


where the final inequality holds as long as // < l/u^. Thus the l/u^-central condition holds. 
Although it is often made, the assumption that F contains the Bayes decision rule (i.e., 
the true regression function) is quite strong. In Section 6 we will encounter Example 6.2 
where, under a compactness restriction on V, the central condition still holds even though 
F may be misspecified. ■ 


Example 3.8 (Bernoulli, 0/1-loss and the margin condition) Let Z = F = {0, 1}, for 

any 0 < 5 < 1/2 let P ,5 be the set of distributions P on Z with \P{Z = 1) — 1/2| > 6, and let 
be the 0/1-loss with f) = \y—f\- For every <5 > 0, there is an // > 0 such that the //- 
central condition holds for F). To see this, let f* be the Bayes act for P, i.e., f* = l 
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if and only if P{Z = 1) > 1/2, and, for f ^ f*, define A{r]) = 

Then ^(0) = 1 and the derivative ^^(0) is easily seen to be negative, which implies the 
result. However, as (5 0, so does the largest rj for which the central condition holds. For 

(5 = 0, the central condition does not hold any more. Since the central condition and the 
PPG condition are equivalent, this also follows from Proposition 3.3: if (5 = 0, then there 
exist P £ V with P{Z = 1) = 1/2, and for this P both / G = {0,1} have equal risk so 
there is no unique minimum. For each <5 > 0, the restriction to Vs may also be understood 
as saying that a Tsybakov margin condition (Tsybakov, 2004) holds with noise exponent 
oo, the most stringent case of this condition that has long been known to ensure fast rates. 
As will be seen in Example 5.5 the Tsybakov margin condition can also be thought of as a 
Bernstein condition with /? = 0 and H f oo as (5 0 (in practice, however, this condition is 

usually applied in the conditional setting with covariates X). Finally, just like the squared 
loss examples, this example can be recast in terms of log-loss as well. Fix /3 > 0 and let Pjs 
be the subset of the Bernoulli model containing two symmetric probability mass functions. 
Pi and po, where pi{l) = Po(0) = j (1 -|- e^) > 1/2. Then the log loss Bayes act for P is pi 

if and only if P{Z = 1) > 1/2. For P £Vs and /'//*, = A{f3ri), 

which by the same argument as above can be made < 1 if ry > 0 is chosen small enough 
(provided <5 > 0). ■ 


3.2 Equivalence of Central and Pseudoprobability Convexity Conditions 

The following result shows that no additional assumptions are required for the central 
condition to imply the pseudoprobability convexity condition. 

Proposition 3.9 Fix an arbitrary decision problem (i,V,P) and e > 0. If the tj-central 
condition holds up to e then the rj-pseudoprohability convexity condition holds up to s. In 
particular the (strong) rj-central eondition implies the (strong) rj-pseudoprobability convexity 
condition. 


Proof Let P £ V and H £ A(T') be arbitrary. Assume the 77 -central condition holds up to 
e. Then 


[4(P,(Z) - mS(Z)] 


= - E log E 
ry /~n 

< - log E E 
Z'^p /~n 




< £• 


where the first inequality is Jensen’s and the second inequality follows from the central 
condition (12). ■ 

To obtain the reverse implication we require either Assumption A (i.e., that minimum risk 
within P is achieved) or, if Assumption A does not hold, the boundedness of the loss^. 
Below we use the term ‘essentially unique’ in the sense of Proposition 3.3 and call any g* 
such that ig*{Z) = if*{Z) occurs P-almost-surely a version of f*. 


3. We suspect this latter requirement can be weakened, at the cost of considerably complicating the proof. 
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Theorem 3.10 Let he a decision problem. Then the following statements both 

hold: 

1. Ifi is bounded, then the weak rj-pseudoprobability convexity condition implies the weak 
7]-central condition. 

2. Moreover, if Assumption A holds, then (irrespective of whether the loss is hounded) 
the weak p-pseudoprobahility convexity condition implies the strong p-central condition 
with comparator function 4>{P) := f* for P-optimal f*. That is, f* can be any version 
of the essentially unique element of J- that satisfies (3). 

The proof of Theorem 3.10 is deferred to Appendix A.l. It generalizes a result for log loss 
from the PhD thesis of Li (1999, Theorem 4.3) and Barron (2001).'^ Theorem 3.10 leads to 
the following useful consequence. 

Corollary 3.11 Consider a decision problem and suppose that Assumption A 

holds. Then the following are equivalent: 

1. The weak p-central condition is satisfied. 

2. The strong rj-central condition is satisfied with comparator function 4> as given by 
Theorem 3.10. 

3. The weak rj-pseudoprobability convexity condition is satisfied. 

4 . The strong rj-pseudoprobability convexity condition is satisfied. 

If any of these statements hold, then for all P £ V, the corresponding optimal f* is essen¬ 
tially unique in the sense of Proposition 3.3. 

Proof Suppose that the r/-(weak) pseudoprobability convexity condition holds and that 
Assumption A holds. This implies that the infimum in (16) is always achieved, from which 
it follows that the strong r^-pseudoprobability convexity condition holds. The assumption 
also lets us apply Theorem 3.10 which implies that the strong r^-central condition holds 
with (j) as described. This immediately implies the weak 77 -central condition which, via 
Proposition 3.9, implies the weak r 7 -pseudoprobability convexity condition. ■ 

The corollary establishes the equivalence of the weak and strong central and pseudoprobabil¬ 
ity convexity conditions which we assumed in Section 2.2. The result prompts the question 
whether rion-uniqueness of the optimal f* might imply that the four conditions do not hold. 
While this is not true in general, at least for bounded losses it is ‘almost’ true if we replace 
the rz-fast rate conditions by the weaker notion of u-fast rate conditions of Section 5 (see 
Proposition 5.11). 

4. Under Assumption A, the proof of Theorem 3.10 shows that it is actually sufficient if the weak pseudo¬ 
probability convexity condition only holds for distributions If on f* and single f £ J-. Via Proposition 3.9 
we then see that this actually implies weak pseudoprobability convexity for all distributions If. 
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3.3 Interpretation as Convexity of the Set of Psendoprobabilities and a 
Bayesian Interpretation 

As we will now explain both the pseudoprobability convexity condition and, by the equiva¬ 
lence from the previous section, the central condition may be interpreted as a partial con¬ 
vexity requirement. For simplicity, we restrict ourselves to the setting of Assumption A from 
Section 2.2. We first present this interpretation for the logarithmic loss from Example 2.2 
on page 9, for which it is most natural and can also be given a Bayesian interpretation. 

Example 3.12 (Example 2.2 continued: convexity interpretation for log loss) 

Let P G P be arbitrary. Under Assumption A the strong 1-pseudoprobability convexity 
condition for log loss says that 


E [-logr(Z)]< min E -log E [/(Z)] , 
neA(P) [ /~n 


min E [- log f{Z)] = min E [- log f{Z)] , 

jGJ- Z'^F j ^ co (^) Z'^F 


z.e., 


( 20 ) 


where f* = <j){P) and co(P) denotes the convex hull of F (i.e., the set of all mixtures of 
densities in F). This may be interpreted as the requirement that a convex combination of 
elements of the model F is never better than the best element in the model. This means 
that the model is essentially convex with respect to P {i.e., ‘in the direction facing’ P — 
see Eigure 2). 

In particular, in the context of Bayesian inference, the Bayesian predictive distribution 
after observing data Zi,..., Zn is a mixture of elements of the model according to the 
posterior distribution, and therefore must be an element of 00(7-"). The pseudoprobability 
convexity condition thus rules out the possibility that the predictive distribution is strictly 
better (in terms of expected log loss or, equivalently, KL-divergence) than the best single 
element in the model. This might otherwise be possible if the posterior was spread out over 
different parts of the model. This interpretation is explained at length by Griinwald and 
van Ommen (2014) who provide a simple regression example in which (20) does not hold 
and the Bayes predictive distribution is, with substantial probability, better than the best 
single element f* in the model, and the Bayesian posterior does not concentrate around 
this optimal f* at all. ■ 


Eor log loss, the convexity requirement (20) is, by Corollary 3.11, equivalent to the strong 
1-central condition and can thus be written as 


E 


' fjZ) ' 


< 1 


( 21 ) 


for all f ^ F. Recognizing (6) we therefore also recover the result by Li (1999) mentioned 
in Example 2.2. 


Example 3.13 (Bayes-MDL Condition) The 1-central condition (21) for log loss plays 
a fundamental role in establishing consistency and fast rates for Bayesian and related meth¬ 
ods. Due to its use in a large number of papers on convergence of MDL-based methods 
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pseudo-probability 
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pseudo-probability 
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satisfied 


Figure 2: The pseudoprobability convexity condition interpreted as convexity of the set of 


pseudoprobabilities with respect to P. 


(Griinwald, 2007) and Bayesian methods and lack of a standard name, we will henceforth 
call it the Bayes-MDL condition. Most of the papers using this condition make the tradi¬ 
tional assumption that the model is well-specihed, he., for every P € V, P contains the 
density of P. As already mentioned in Example 2.2, the condition then holds automatically, 
so one does not see (21) stated in those papers as an explicit condition. Yet, if one tries to 
generalize the results of such papers to the misspecified case, one invariably sees that the 
only step in the proofs needing adjustment is the step where (21) is implicitly employed. If 
the model is incorrect yet (21) holds, then the proofs invariably still go through, establishing 
convergence towards the f* that minimizes KL divergence to the true P. This happens, 
for example, in the MDL convergence proofs of Barron and Cover (1991); Zhang (2006a); 
Griinwald (2007) as well as in the pioneering paper by Doob (1949) on Bayesian consistency. 
The dependence on (21) becomes more explicit in works explicitly dealing with misspecifi- 
cation such as those by Li (1999); Kleijn and van der Vaart (2006); Griinwald (2011). For 
example, in order to guarantee convergence of the posterior around the best element f* of 
misspecified models, Kleijn and van der Vaart (2006) impose a highly technical condition 
on {£,V,P). If, however, (21) holds then this complicated condition simplihes to the stan¬ 
dard, much simpler condition from (Ghosal et ah, 2000) which is sufficient for convergence 
in the well-specihed case. The same phenomenon is seen in results by Ramamoorthi et al. 
(2013); De Blasi and Walker (2013). Griinwald and Langford (2004) and Griinwald and van 
Ommen (2014) give examples in which the condition does not hold, and Bayes and MDL 
estimators fail to converge. ■ 


The convexity interpretation for log loss may be generalized to other loss functions 
via loss dependent ‘pseudoprobabilities’. These play a crucial role both in online learning 
(Vovk, 2001) and the PAG-Bayesian analysis of the Bayes posterior and the MDL estimator 
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by Zhang (2006a). For log loss, we may express the ordinary densities in terms of the loss 
as f{z) = This generalizes to other loss functions by letting r]if{z) play the role 

of the log loss, where rj > 0 is the scale factor that appears in all our dehnitions. We thus 
obtain the set of pseudoprobabilities 

VAv) = {z^ : / G -F} , 

which are non-negative, but do not necessarily integrate to 1. The only feature we need 
of these pseudoprobabilities is that their log loss is equal to r] times the original loss, 
because, analogously to (20), this allows us to write the strong r/-pseudoprobability convexity 
condition as 

min E [—log/(Z)l< min E [—log/(Z)l. 
f&rT{v)Z^P /eco(P^(r,)) Z~P ^ 

Figure 2 provides a graphical illustration of this condition. Thus, for any loss function we 
can interpret the pseudoprobability convexity condition as the requirement that the set of 
pseudoprobabilities is essentially convex with respect to P. As suggested by Vovk (2001); 
Zhang (2006a), one can also run Bayes on such pseudoprobabilities, and then the pseudo- 
probability convexity condition again implies that the resulting pseudo-Bayesian predictive 
distribution cannot be strictly better than the single best element of the model. The log 
loss achieved with such pseudoprobabilities, and hence rj times the original loss, can be 
given a code length interpretation, essentially allowing arbitrary loss functions to be recast 
as versions of logarithmic loss (Griinwald, 2008). 

4. Online Learning 

In this section, we discuss conditions for fast rates that are related to online learning. Our 
key concept is introduced in Section 4.1, where we define stochastic mixability, the natural 
stochastic generalization of Vovk’s notion of mixability, and show (in Section 4.2) how it 
unifies existing conditions in the literature. Section 4.3 contains the main results for this 
section, which connect stochastic mixability to the central condition and to pseudoproba¬ 
bility convexity. As an intermediate step, these results use a fourth condition called the 
predictor condition, which is related to the central condition via a minimax identity. We 
show that, under appropriate assumptions, all four conditions are equivalent. This equiva¬ 
lence is important because it relates the generic condition for fast rates in online learning 
(stochastic mixability) to the generic condition that enables fast rates for proper in-model 
estimators in statistical learning (the central condition). 

4.1 Stochastic Mixability in General 

Stochastic mixability generalizes from (8) similarly to the way we have generalized the 
central condition and pseudoprobability convexity. Let be the mix loss, as defined 

in (14). 

Definition 4.1 (The Stochastic Mixability Condition) Let rj > 0 and e > 0. We say 

that {£,V,P,Pd) is ry-stochastically mixable up to e if there exists a substitution function 
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■0: A(J^) —)■ Fd such that 

E [i^^u){Z)] < E [m^iZ)] + e for allPeV and H e A(J^). (22) 

Z'^P Z'^P 

If it is rj-stochastically mixable up to 0, we say that it is strongly r/-stochastically mixable 
or simply 77 -stochastically mixable. If it is rj-stochastically mixable up to e for all e > 0, we 
say that it is weakly 7/-stochastically mixable; this is equivalent to 

sup inf sup E — <0. (23) 

ugAF) P&'P 

Unlike for the central and pseudoprobability convexity conditions (see Corollary 3.11), for 
stochastic mixability it is not clear whether the weak and strong versions become equiva¬ 
lent under the simplifying Assumption A. We do have a trivial yet important extension of 
Fact 3.4: 

Fact 4.2 Fix p > > 0 and let {£,F,F,Fd) be an arbitrary decision problem that is 

■q- stochastically mixable up to e. Then for any 0 < q' < q, any e' > e and for any V' C "P, 
F' C F and F'^ O Fd, , F', F'^ is q' -stochastically mixable up to e'. 

4.2 Relations to Conditions in the Literature 

As explained next, stochastic mixability generalizes Vovk’s notion of (non-stochastic) mix- 
ability, and correspondingly implies fast rates. Its most important special case is stochastic 
exp-concavity, for which Juditsky et al. (2008) give sufficient conditions, and which is used 
by, e.g., Dalalyan and Tsybakov (2012). Stochastic mixability is also equivalent to a special 
case of a condition introduced by Audibert (2009). 

4.2.1 Generalization of Vovk’s Mixability and Fast Rates for Stochastic 
Prediction with Expert Advice 

If we take e = 0 and let V be the set of all possible distributions, then (22) reduces to 

^i/’(n)(-^) — all z G Z and n G A(J^), (24) 

which is Vovk’s original definition of (non-stochastic) mixability (Vovk, 2001). It follows 
that Vovk’s mixability implies strong stochastic mixability for all sets V. 

Example 4.3 (Mixable Losses) Losses that are classically mixable in Vovk’s sense, in¬ 
clude the squared loss F'^{f., z) = ^(z — /)^ on a bounded domain Z = Fdfl F = [—B, R], 
which is 1/R^-mixable (Vovk, 2001, Lemma 3)^, and the logarithmic loss, which is 1-mixable 
for Fd U co{F) with substitution function equal to the mean = Ej-...,n[/]- The Brier 
score is also 1-mixable (Vovk and Zhdanov, 2009; van Erven et ah, 2012b); this loss function 
is defined for all possible probability distributions Fd = F on a finite set of outcomes Z 
according to hj™’’(z) = ~ where 6 z denotes a point-mass at z. ■ 


5. Taking into account the factor of | difference between his definition of squared loss as (z — f)^ and ours. 
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Example 4.4 (0/1 Loss: Example 3.8, Continued) Fix 0 < <5 < 1/2 and consider a 
decision problem where is the 0/1-loss, Z = P = {0,1} and Vs is as in 

Example 3.8. The 0/1-loss is not j^-mixable for any r] > 0 (Vovk, 1998), and it is also easily 
shown that {i^^,Vs,J~,J~) is not ? 7 -stochastically mixable for any r/ > 0; nevertheless, if 
<5 > 0, then {i^^,Vs,J^) does satisfy the ry-central condition for some rj > 0. In Section 4.3 
we show that, under some conditions, the yy-central condition and ry-stochastic mixability 
coincide, but this example shows that this cannot always be the case. ■ 


Vovk defines the aggregating algorithm (AA) and shows that it achieves constant regret 
in the setting of prediction with expert advice, which is the online learning equivalent 
of fast rates, provided that (24) is satished. In prediction with expert advice, the data 
Zi,... ,Zn are chosen by an adversary, but one may define a stochastic analogue by letting 
the adversary instead choose Pi,... ,Pn G V, where the choice of Pi may depend on the 
player’s predictions on rounds 1,... ,i — 1, and letting Zi ~ Pi for all i = 1,...,n. It 
turns out that under no further conditions, stochastic mixability implies fast rates for the 
expected regret under Pi,..., Pn in this stochastic version of prediction with expert advice. 
In particular, there is no requirement that losses are bounded. 

Proposition 4.5 Let {I, V, P, V/j) rj-stochastically mixable up to £ with substitution func¬ 
tion Ip. Assume the data Zi,..., Zn are distributed as Zj ^ Pj for each j G [n], where 
the Pj can be adversarially chosen. Then the AA, playing fj G Td in round j, achieves, for 
all f ^ F, regret 


j = l i h 

In particular, in the statistical learning (stochastic i.i.d.) setting where Pi,... ,Pn all equal 
the same P, online-to-batch conversion yields the bound -T e on the expected regret 

and hence on the rate (2) of the A A is + e)- 

Proof For e = 0, the hrst result follows by replacing every occurrence of mixability with 
stochastic mixability in Vovk’s proof (see Section 4 of Vovk (1998) or the proof of Propo¬ 
sition 3.2 of Cesa-Bianchi and Lugosi (2006)). The case of e > 0 is handled simply by 
adding a slack of e to the RHS of the first equation after equation (18) of Vovk (1998). The 
online-to-batch conversion of the second result is well-known and can be found e.g. in the 
proof of Lemma 4.3 of Audibert (2009). ■ 


4.2.2 Special Case: Stochastic Exp-concavity 

In online convex optimization, an important sufficient condition for fast rates requires the 
loss to be rj-exp-concave in / (Hazan et ah, 2007), meaning that T" = Td is convex and that 

Q-'itfP) ig concave in / for all z ^ Z. (25) 
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We may equivalently express this requirement as 

> E 

/~n 

4^.n[/](^) ^ 

for all distributions IT G A(J^) and all 2 ; G Z. This shows that exp-concavity is a special 
case of mixability, where we require the function i/^ to map II to its mean: 

^(n) = E [/]. 

/~n 

Because the mean Ej^n[/] depends not only on the losses but also on the choice of 
parameters /, we therefore see that exp-concavity is parametrization-dependent, whereas in 
general the property of being mixable is unaffected by the choice of parametrization. The 
parametrization dependent nature of exp-concavity is explored in detail by Vernet et al. 
(2011); Kamalaruban et al. (2015); see also van Erven et al. (2012b); van Erven (2012). 

Example 4.6 (Exp-concavity) Consider again the mixable losses from Example 4.3. 
Then the log loss is 1-exp concave. The squared loss, in its standard parametrization, 
is not l/B^-exp-concave, but it is l/(4i?^)-exp-concave, losing a factor of 4 (Vovk, 2001, 
Remark 3). By continuously reparametrising the squared loss, however, it can be made 
1/B^-exp-concave after all (Kamalaruban et ah, 2015; van Erven, 2012). It is not known 
whether there exists a parametrization that makes the Brier score 1-exp-concave. ■ 




or 


The natural generalization of exp-concavity to stochastic exp-concavity becomes: 

Definition 4.7 Suppose Td 2 co(T'). Then we say that {i,V,Tis rj-stochastically 
exp-concave up to e or strongly/weakly rj-stochastically exp-concave if it satisfies the corre¬ 
sponding case of stochastic mixability with substitution function V’(n) = 'Ef^ n[/]- 


4.2.3 The JRT Conditions Imply Stochastic Exp-concavity 

Juditsky, Rigollet, and Tsybakov (2008) introduced two conditions that guarantee fast rates 
in model selection aggregation. For now we focus on the following condition, mentioned in 
their Theorem 4.2, which we henceforth refer to as the JRT-II condition, returning to the 
JRT-I condition, mentioned in their Theorem 4.1, in Section 5.3. 


Definition 4.8 (JRT-II condition) Let r] > 0. We say that {k,V,F) satisfies the rj- 
JRT-II condition if there exists a function 7 : T" x T" —)■ M satisfying (a) for all f G T, 
= 1; (^) for oil f gT , the function g 1 —)• 'y{f,g) is concave, and (c) 


for all P GV and f,gGP: 


E 


/•n/f(z)-e,(z)) 


< 7(/, g)- 


(26) 


This condition has been used to obtain fast 0(l/n) rates for the mirror averaging estimator 
in model selection aggregation, which is statistical learning against a finite class of functions 
P = {fi, ■ ■ ■, fm} (Juditsky et al., 2008). One may interpret their approach as using 
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Vovk’s aggregating algorithm to get 0(1) expected regret, and then applying online-to- 
batch conversion (Cesa-Bianchi et ah, 2004; Barron, 1987; Yang and Barron, 1999), which 
leads to an estimator whose risk is upper bounded by the expected regret divided by n. 
This use of the AA is allowed, because, if D co{T), then the JRT-II condition implies 
strong stochastic exp-concavity, as already shown by Audibert (2009) as part of the proof 
of his Corollary 5.1: 

Proposition 4.9 If satisfies the r]-JRT-II condition, then satisfies 

the strong rj-stochastic exp-concavity condition for any Td 2 co(T'). 

Proof From the JRT-II condition, for all P G P and 11 G A(P) 

E E < E 7 (V’(n),c/), 

Z'^p 

which from the concavity of 7 in its second argument is at most 

7 (m), E o) = 7(V'(n),V'(n)) = 1, 

V 5~n y 

by the definition of ip and part (a) of the JRT-II condition. Thus, we have 

E E < 1. 

Z'^p 

Applying Jensen’s inequality to the exponential function completes the proof. ■ 

Juditsky et al. (2008) use the JRT-II condition in the proof of their Theorem 4.2 as a suffi¬ 
cient condition for another condition, which is then shown to imply 0 (l/n) rates for finite 
classes J-. After some basic rewriting, this other condition (which requires the formula 
below Eq. (4.1) in their paper to be < 0) is seen to be equivalent to strong stochastic 
exp-concavity as defined in Definition 4.7, i.e. it requires that (22) holds with e = 0 and 
substitution function V’(n) = ^fr-^uif]- The JRT-I condition, which we define in Section 5.3, 
can be related to stochastic exp-concavity with nonzero e, thus we may say that the under¬ 
lying condition that JRT work with is equivalent to our stochastic exp-concavity condition, 
albeit that they restrict themselves to a finite class of functions. 


4.2.4 Relation to Audibert’s Condition 


Audibert (2009, p. 1596) presented a condition which he called the variance inequality. It 
is defined relative to a tuple (f, P,P, PJ) and has the following requirement as a special 
case (in Audibert’s notation, this corresponds to = 0 and II a Dirac distribution on some 
/ e -Td): 


vn G A(P) 3/ G Pd sup 

P&P 


E log E 

Z'^p p'~ri 


e 




< 0 . 


Rewriting 


E 


log E 

9~n 


r^(ti(Z)-e,(Z)) 


rj B [if{Z)-ml{Z)], 

Z'-^P 


this is seen to be precisely equivalent to strong stochastic mixability. 
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4.3 Relations with Central and Pseudoprobability Convexity Conditions 

We now turn to the relations between stochastic mixability and the two main conditions 
from Section 3: the central condition and pseudoprobability convexity. We first define the 
predictor condition, which will act as an intermediate step, and then show the following 
implications: 

predictor => stochastic mixability PPC => CC => predictor (under assumptions.) 

The implication from pseudoprobability convexity to the central condition was shown in 
Theorem 3.10 from Section 3.2; we will consider the other ones in turn in this section. 
The second implication is of special interest since, in the online setting, there is extra power 
because predictions may take place in a set Td that can be larger than T. The conditions of 
the second implication will identify situations in which this additional power is not helpful. 


4.3.1 The Predictor Condition in General 
We define the general predictor condition as follows: 


Definition 4.10 (Predictor Condition) Let rj > 0 and e > 0. We say that 

satisfies the ry-predictor condition up to e if there exists a prediction function if: A(T') —Td 

such that 


E E 

/~n 




< 


for all P and distributions 11 on T. (27) 


If it satisfies the p-predictor condition up to 0, we say that the strong ? 7 -predictor condition 
or simply the //-predictor condition holds. If it satisfies the tj-predictor condition up to e 
for all e > 0, we say that the weak //-predictor condition holds; this is equivalent to 


sup inf sup E E 
neA(p) /e J'd p&v 9~n 




< 1 . 


(28) 


Comparing (28) to the central condition, we see that the predictor condition looks similar, 
except that the suprema over 11 and P are interchanged. We note that, trivially. Fact 4.2 
extends from //-stochastic mixability to the //-predictor condition. 


4.3.2 Predictor Implies Stochastic Mixability 

By an application of Jensen’s inequality, the predictor condition always implies stochastic 
mixability, without any assumptions: 


Proposition 4.11 Suppose that {P,i,P,Pfi) satisfies the rj-predictor condition up to some 
e > 0. Then it is p-stochastically mixahle up to e. In particular, the (strong) rj-predictor 
condition implies (strong) rj-stochastic mixability. 


Proof Let P G P,n G A(T') and e > 0 be arbitrary. Then, by Jensen’s inequality, the 
//-predictor condition up to e implies 


e^^ > 


E 

/~n 




E 

Z~P 


g»?(^V'(n)(^)-”*n(^)) 


— iz)]_ 
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Taking logarithms on both sides leads to [£^(n)(^)] < E^^p + e, which is 

ry-stochastic mixability up to e. ■ 


4.3.3 Stochastic Mixability Implies Pseudoprobability Convexity 

In Proposition 4.12 below, we show that, under the right assumptions, stochastic mixability 
implies pseudoprobability convexity. 

A complication in establishing this implication is that stochastic mixability is defined 
relative to a four-tuple (£, P, Ju), and allows us to play in a decision set that is different 
from P, whereas the pseudoprobability convexity is defined relative to the triple {£,V,T). 
The proposition automatically holds if one takes J- = and then the implication follows 
trivially. In practice, however, we may have a non-convex model P — as is quite usual 
in e.g. density estimation — whereas the decision set Pa for which we can establish that 
is r/-stochastically mixable is equal to the convex hull of T. It would be quite 
disappointing if, in such cases, there would be no hope of getting fast rates for in-model 
statistical learning algorithms. The second part of the proposition shows that, luckily, fast 
rates are still possible under the following assumption: 

Assumption B (model T and decision set equally good — T well-specified 
relative to Pd) We say that Assumption B holds weakly for (^, V, P, Pd), if, for all P G V, 

inf P(P,/)= inf RiPJ). (29) 

We say that Assumption B holds strongly if additionally, for all P G V, both infima are 
achieved: minj-gj- R{P, f) = minjgj-^ R{P-, /)• 

The strong version of Assumption B implies Assumption A and will be used further on in 
Theorem 4.14. In a typical application of the proposition below, the weak Assumption B 
would be assumed relative to a Pd such that P C Pd • 

Proposition 4.12 Suppose that Assumption B holds weakly for Pd)- 

is rj-stochastically mixable up to some e > 0, then {l,V,P) satisfies the g-pseudoprobability 
convexity condition up to 6 for any 6 > e; in particular, weak rj-stochastic mixability of 
(£,'P,P, Pd) implies the weak rj-PPC condition for {£,V,P). Moreover, if Assumption A 
also holds and {i,P,P,Pd) satisfies strong rj-stochastic mixability, then {£,V,P) satisfies 
the strong rj-PPC condition. 

If Assumption A and the weak version of Assumption B both hold, then, using this propo¬ 
sition, if we have r/-stochastic mixability for {£,V,P,Pd) we can directly conclude from 
Theorem 3.10 that we also have the r/-central condition for (£,V,P). So when does As¬ 
sumption B hold? Let us assume that {£,P,P,Pd) satisfies ry-stochastic mixability. In all 
cases we are aware of, it then also satishes r/-stochastic mixability for {£,V,P,P'^), where 
Pj is equal to, or an arbitrary superset of, co(P) — in the special case of ry-stochastic exp- 
concavity this actually follows by definition. An extreme case occurs if we take P^ := Pi to 
be the set of all functions that can be defined on a domain (Example 2.1). Then Assump¬ 
tion B expresses that the model P is well-specified. But the assumption is weaker: assuming 
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again that J-^ can be taken to be the convex hull of it also holds if J- is itself convex and 
contains, for all P £V, a risk minimizer; and also, if, more weakly still, P is convex ‘in the 
direction facing P' . Note that, for the log-loss, we already knew that the 1-central condition 
holds under this condition, from the Bayesian interpretation in Section 3.3. There we also 
established a generalization to other loss functions: the //-central condition holds if the set 
of pseudoprobabilities Vj: is convex ‘in the direction facing P’ (Figure 2). But, for all loss 
functions except log-loss, that was a condition involving pseudoprobabilities and artificial 
(mix) losses. The novelty of Proposition 4.12 is that, if //-stochastic mixability holds for 
{i,V,P,Pd) with Pd = co(P) (as e.g. when we have //-stochastic exp-concavity), then the 
result generalizes further to ‘the //-central condition holds if the set P itself (rather than 
the artificial set Vj^) is convex in the direction facing P\ 

Example 4.13 (Fast Rates in Expectation rather than Probability) Fast rate re¬ 
sults proved under the //-central condition, such as our result in Section 7 and the various 
results by Zhang (2006b) generally hold both in expectation and in probability. The situ¬ 
ation is different for //-stochastic mixability: extending the analysis of Vovk’s Aggregating 
Algorithm to tuples (f,'P,P, Pd) and using the online-to-batch conversion, we can only 
prove a fast rate result in expectation, and not in probability. Audibert (2007) provides 
a by now well-known example (P^, P, P, co(P)) with squared loss in which the rate ob¬ 
tained by the exponentially weighted forecaster (the aggregating algorithm applied with 
//:(n) = Ej-...,n[/]) followed by online-to-batch conversion is 0(1///) in expectation, yet only 
X l/\/n in probability; and ERM also gives a rate, both improbability and in-expectation of 
l/\/n (Theorem 2 of (Audibert, 2007)). As might then be expected, in Audibert’s decision 
problem //-exp-concavity holds for some // > 0 yet the central condition does not hold for 
any // > 0. Proposition 4.12 then implies that Assumption B must be violated: the best 
/ G co(P) is better than the best / G P. Inspection of the example shows that this indeed 
the case (a related point was made earlier by Lecue (2011)). ■ 


Proof (of Proposition 4.12) Note that (22), the definition of //-stochastic mixability up 
to e, can be rewritten as 

vn G A(P) 3/ G Pd VP G P : E [f;(Z)] < E [/n/f(P)] + £. 

This trivially implies 

vn G A(P) VP G P 3/ G Pd : E [£f{Z)] < E [/n^(P)] + J, (30) 

Z'^y Z'-^y 

for any 5 > £. This implies that for any <5 > e, we can assume that the choice of / in (30) 
only depends on P and not on H. We would therefore obtain //-pseudoprobability convexity 
up to any <5 > e of {i,V,P) if we could replace Pd by P, which is trivial if Pd = P and 
allowed under Assumption B because it implies that, for any / G Pd we can find /' G P 
such that PjZr^p ^£f'{Z)j — ^ 6 — £■ 

For the hnal implication, note that under Assumption A we can choose 5 = £, and by 
Corollary 3.11 we can choose e = 0. ■ 


29 


4.3.4 The Central Condition Implies the Predictor Condition 


We proceed to study when the central condition implies the predictor condition (with = 
T), which requires the strongest assumptions among the implications we consider. We first 
identify a minimax identity (32) that is sufficient by itself (Theorem 4.14), but difficult to 
verify directly. We therefore weaken Theorem 4.14 to Theorem 4.17 by providing sufficient 
conditions (Assumption D) for the minimax identity. 

For any Ff and r], define the function 




E E 

Z'^P 




which is the main quantity in the definitions of both the central and the predictor condition. 


Assumption C (Minimax Assnmption) For given r/ > 0, we say that the i^-minimax 
assumption is satisfied for {£,V,F,Fd) if, for allH G A(T') and for all C >1, the following 
implication holds: 

sup inf S^{P, f) < C inf sup S^{P, f) < C. (31) 

PeP /ePd /ePd PeP 

We call this the minimax assumption, because (31) is implied by the minimax identity 

sup inf S^{P,f)= inf sup S^{P,f). (32) 

PeP /ePd /ePd PeP 

Theorem 4.14 below implies that Assumption C is sufficient for the central condition to 
imply the predictor condition, with Tj = P- Intuitively, Assumption C should hold under 
broad conditions — just like standard minimax theorems hold under broad conditions. 
Below we will identify the specific, less elegant but more easily verifiable Assumption D that 
implies Assumption C. However, like conditions for standard minimax theorems, in some 
cases Assumption D requires Td C M to be compact, yet we want to apply the theorem 
also in cases where 7^ = 1^. As shown in Example 4.21, in this case we can sometimes still 
use Part (b) of the result, which implies that the assumption is still sufficient if we take 
a smaller set F T that satisfies Assumption B. Note that Assumption B also played 
a crucial role in going from stochastic mixability of (£, P, T", Td) to the PPG condition for 

Theorem 4.14 Consider a decision problem Suppose that {i,P,P,Pd) is such 

that the the rj-minimax assumption (Assumption C) holds. Then 

(a) if P = Pd ond the rj-central condition holds up to some e > 0 for {i,V,P), then 
the rj-predictor condition holds up to any 6 > e for {i,P,P,Pd). In particular, the weak 
■q-central condition implies the weak r]-predictor condition. Moreover, 

(b) if P ^ Pd and {£,'P,P,Pd) satisfies the strong version of Assumption B, then the 
weak r]-central condition for {£, P, P) implies the weak rj-predictor condition for {£, V, P, Pd) 
and therefore also for {£,V,P,P). 

Once we establish that the p-predictor condition holds for {£,P,P,Pd) with Pd C P, by 
Fact 4.2 we can also infer that the p-predictor condition holds for {£,V,P,P'fi) for any 
Tj D Td, in particular for P'^ = P. 
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Proof For Part (a), from the ry-central condition up to e and the fact that the sup inf never 
exceeds the inf sup and that .F = Juj we get 

> sup inf sup S^{P, f) > sup sup inf S^{P, f) = sup sup inf S^{P,f). 
PeF/e J-d neA(j-) PePneA(p)/ePk neA(p) PeP/ePd 

(33) 

This establishes that the premise of (31) holds with C = for all IT G A(J^). Hence 
Assumption C tells us that the conclusion of (31) must also hold for all H G A(J^), and 
therefore 

sup inf sup S^{P, f) < e'^^. 
neA(P) f&TP&v 

Since we are not guaranteed that the infimum over / is achieved, this implies the ry-predictor 
condition up to any S > s, but not necessarily for 6 = e. We thus obtain the first part of 
the theorem. 

For Part (b), we note that, by the premise. Assumption A must hold and we can 
apply Corollary 3.11 which tells us that for all P €V, the fp € P minimizing R{P, f) is 
essentially unique and that the strong r/-central condition holds, i.e. for all P G P, (4) holds. 
As explained below (4), this implies that fp = 4>{P) is P-optimal for P, hence it follows 
that fp = fp, P-almost surely. The strong version of Assumption B then implies that Pd 
contains a gf, with P{if^ = ig^) = 1- We now have, by the strong ry-central condition, that 
for all n G A(P), 

1 > sup inf sup S^{P, f) = sup sup S^{P, fp) = sup sup S^{P,gp) 

P&v f&T ne A{ j-) P&v ne A( j-) p&p ne a{ j-) 

> sup inf sup S^{P,f). 
p&r f&Td neA(j-) 

We have thus established the first inequality of (33) with e = 0; we can now proceed as in 
the hrst part. ■ 


We proceed to identify more concrete conditions that are sufficient for Assumption C. 
To this end, we will endow the set of finite measures (including all probability measures) on 
Z with the weak topology (Billingsley, 1968; Van der Vaart and Wellner, 1996), for which 
convergence of a sequence of measures Pi, P 2 > • ■ ■ to P means that 

E [h{Z)] ^ E [/i(P)] (34) 

for any bounded, continuous function h: Z ^ M.. To make continuity of h well-dehned, we 
then also need to assume a topology on Z. It is standard to assume that Z is a Polish space 
(i.e. that it is a complete separable metric space), because then, from Prokhorov (1956), 
there exists a metric for which the set of finite measures on P is a Polish space as well 
and for which convergence in this metric is equivalent to (34). The weak topology is the 
topology induced by this metric. 

We shall also assume that V is tight, which means that, for any e > 0, there must exist 
a compact event A C. Z such that P{A) > 1 — e for all P G P. This is a weaker condition 
than assuming that the whole space Z is compact because it allows some probability mass 
outside of the compact event A. 
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Assumption D Suppose the set of possible outcomes Z is a Polish space. Let {i, V, TTef), 
n G A(J^) and rj > 0 be given. Then assume that all of the following are satisfied: 

1. For all f G iFUiFd, is continuous in z and (if{z) > 0. 

2. The set is convex and, for any z ^ Z, is convex in f on 

3. The set V is convex and tight. 

4 . Either a) V is closed in the weak topology; or b) Fd is a totally bounded metric space, 

and, for every compact subset Z' of Z, the family of functions {/ 1 — if{z) : z G Z'} 

is uniformly equicontinuous on Fd. 

5. The random variables 
/ G Fd, P GV in the sense that 

lim sup E [f,zj Uzj > ^1] = 0. (35) 

b-s.00 


,v{ef{z)-e,(z)) 


are uniformly integrable over 


While these assumptions may look daunting, they actually hold in many situations even 
with unbounded losses, as our examples below illustrate. In D.l, continuity is automatic 
for finite and countable Z as long as we take the discrete topology. In D.2, convexity of 
in f is implied by convexity of if{z) in /. Regarding the fourth requirement, D.4: 
the condition that V is weakly closed is easily stated but hard to verify for general Z and 
V] the alternative condition is hard to state but often straightforward to verify. And finally, 
D.5 will automatically hold for all bounded loss functions and for many unbounded losses 
as well; for a discussion of uniform integrability as used in D.5, see Shiryaev (1996, pp. 188- 
190). In particular. Lemma 3 on p. 190, specialised to our context, implies the following 
sufficient condition: 


Lemma 4.15 (Sufficient Condition for D.5) For a fixed choice 0 /11 G A(J^), let ^zj 
be as in Assumption D.5. Then (35) is satisfied if 


sup sup E [G{Czj)] < 00 


for any function G : [0, 00 ) —>■ M that is bounded below and is such that 


G{t) . . 

- IS increasing. 


and 


G{t) 


00 . 


(36) 


We may, for instance, take G{t) = t^ or G{t) = t log t. 

Proof Without loss of generality, we may assume that G is non-negative. Otherwise 
replace G{t) by max{G(t),0}, which preserves (36) and adds at most — inftG(t) < 00 to 
sup/ej-^suppgp Bz^p[G{izj)]- 

Now let M = supjgp^ suppg-p Ezr^p [G{f,zj)] and, for any e > 0, take 6 > 0 large 
enough that G{t)/t > M/e for all t >b. Then 


0 < 


sup sup E [Cz,f Uzj >bj]<^ sup sup E [G{Czj) Uzj > bj] 

- if 

Ivl faV. Pc-p 
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from which (35) follows by letting e tend to 0. 


Assumption D is sufficient for the minimax assumption, as our main technical result of 
this section (proof deferred to Appendix A.2) shows: 

Lemma 4.16 Fix {i,V,iF,Td) o,nd r] > 0. If Assumption D is satisfied for a given 11 G 
A(J^), then (32) also holds. Consequently, if Assumption D is satisfied for all 11 G A(J^), 
then that implies Assumption C. 

Together, Theorem 4.14 and Lemma 4.16 prove the following theorem. 

Theorem 4.17 (Central to Predictor) Let rj > 0 and suppose Assumption D holds for 
{t,V,T,Tfi) for all 11 G A(T'). If either T = or the strong version of Assumption B 
holds and T D then the weak g-central condition implies the weak g-predictor condition. 

We now provide some examples which indicate that while Assumption D covers several non¬ 
trivial cases — including non-compact J- — it is probably still significantly more restrictive 
than needed. 

Example 4.18 (Logarithmic Loss) Consider a set of distributions V on some set Z and 
let J- either be the densities or mass functions corresponding to V or an arbitrary convex 
set of densities on Z. By Example 2.2, satisfies the 1-central condition. If we 

further assume that V is convex and tight and that there is a d > 0 such that for all z G .21, 
all / G T", f{z) > 6 (so that the densities are bounded from below), then Assumption D is 
readily verified and we can conclude from the theorem that the 1-predictor condition and 
hence 1-stochastic mixability holds for We know however, because log-loss is 

l-(Vovk-) mixable, that 1-stochastic mixability must even hold if V is neither convex nor 
tight; Assumption D is not weak enough to handle this case, so the example suggests that a 
further weakening might be possible. Also, we know that 1-stochastic mixability continues 
to hold if d = 0; verification of Assumption D is not straightforward in this case, which 
suggests that a simplification of the assumption is desirable. ■ 


Example 4.19 (0/1-Loss , Example 3.8, Continued.) Consider the setting of Exam¬ 
ple 3.8 and Example 4.4 with decision problem ,Vi,F) and <5 > 0. We established in 
Example 3.8 that the r/-central condition then holds for some r] > 0, but also, in Example 4.4, 
that ,Vs,F,F) is not ry-stochastically mixable. We would thus expect Assumption D 
to fail here, which it does, since F = Td is not convex. ■ 


Example 4.20 (Squared Loss, Restricted Domain) Let I be the squared loss t^{z) := 
— ffi on the restricted spaces Z = F = F^ = as in Example 4.3, and take 

V to be the set of all possible distributions on Z. Then the first three requirements of 
Assumption D may be verified by observing that (and therefore also e^^f ^^^) is convex 

in /, and that V is trivially tight by taking A = Z. Now V is actually closed in the weak 
topology, but, in order to satisfy the fourth condition, we might also use that the mappings 
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{/ 1 —^ : z G Z} are all Lipschitz with the same Lipschitz constant {2B), which implies 

that they are also uniformly equicontinuous. Finally, to see that the fifth requirement is 
satisfied for any 11 G we may appeal to Lemma 4.15 with G{t) = and use that 

is uniformly bounded. 

Then all parts of Assumption D are satisfied for all IT G A(T'). We know from Ex¬ 
ample 4.3 that in this case classical r/-mixability holds for rj = This implies strong 

r/-stochastic mixability, which implies the strong r/-pseudoprobability convexity condition 
(by Proposition 4.12). Since Assumption A holds, this in turn implies the strong //-central 
condition (by Theorem 3.10), and by applying Theorem 4.17 one can then infer the weak 
//-predictor condition. ■ 


In the example above, the set V was convex and, by boundedness of Z, automatically tight 
and thus the //-central condition and //-stochastic mixability both hold. In Example 3.5 
we established the //-central condition for a set V that is neither convex nor tight, so 
Assumption D fails and we cannot apply Theorem 4.17 to jump from the //-central to the 
//-predictor condition as in Example 4.20. However, as the next example shows, if we replace 
V by its convex hull for a restricted range of ;U, then we can recover the predictor condition 
via Theorem 4.17 after all; restriction of J-", however, is not needed. 


Example 4.21 (Squared Loss, Unrestricted Domain: Example 3.5, Continued.) 

Consider the squared loss t^{z) = ^{z — /)^, and let Z = M, T" = [—B,B] (later we will 
consider B = M), and let V = 1) : € [—M, M]}) be the convex hull of the set of 

normal distributions with unit variance and means bounded by M < H. We may represent 
any P G "P as a mixture of AA(//, 1) under some distribution w on //. Let //p be the mean 
of P. Then, for all P G P with corresponding w and all t G M, 


E 


g/(Z-/ip) 


pM 

I E 

J-M 




dw{fj,) = e 


= 


rM 


'-M 


edM-Mp)du;(/i) < eF/2g/2MV2^ 


where the last inequality follows from Hoeffding’s bound on the moment generating function 
and the observation that /ip = Thus the elements of P are all subgaussian with 

variance a‘^ = 1 + Hence, by the argument in Example 3.6, the strong //-central 

condition is satisfied for r] <1/{1 + M^) and with substitution function </>(P) = fj-p- 

In order to also get the predictor condition via Theorem 4.17, we need to verify As¬ 
sumption D. The first three parts of this assumption may be readily verified, and part b) of 
D.4 also holds, because the mappings {/ i —^{z — f)^ : z G [—A, A]} are all (2A)-Lipschitz, 
which implies their uniform equicontinuity, for any choice of A. Einally, Assumption D.5 
follows from Lemma 4.15 with G{t) = and Jensen’s inequality: 


sup sup E 
f&T P&V 



2 

< sup sup E E 

g2r,(T/(Z)-€=pZ))' 


/ep PeP 

- 


< sup sup E 

f,gep 


' ^2g(q{Z)-Q(Z)) 


sup sup E 

f,g&P P&P^^P 


g2,?(/2+2Z(a-/)-g2) 


< sup sup E 

f,g&TP&P 


^iriZ(g-f) 


< sup sup e®’' i^+M )+4ri{g f)fip ^ ^ 

f,g&PP&P 
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where (*) follows from (1 + M^)-subgaussianity. Thus, Theorem 4.17 can be applied to 
establish the weak //-predictor condition for squared loss on an unbounded domain = M 
for the choices of rj, Fd = ^ and V described above. 

Now consider the case where we set T = = Z and leave everything else unchanged. 

Then by the argument in Example 3.6, the strong //-central condition is still satisfied for 
rj < 1/(1 -|- M^), but we cannot directly use Theorem 4.17 to establish the weak predictor 
condition for ^T). All steps of the above reasoning go through except part b) of 

D.4, since T is no longer compact. However, if we take Ta = [~B, B\ for B > M, then 
Assumption D.4 (which only refers to -Fj, not to J-) holds after all. Moreover, the strong 
version of Assumption B also holds, because argmin^-gj^ E^....,p(Z — /)^ = /ip. We can thus 
use Theorem 4.17 to conclude that {i,V,T,Td) satisfies the weak //-predictor condition. It 
then follows by Fact 4.2 that satisfies the weak //-predictor condition as well. 

We conclude that the implication //-central =► weak //-predictor goes through, even though 
T is not compact. ■ 


This final example shows how Theorem 4.17 allows us to find assumptions on V that are 
sufficient for establishing the weak predictor condition, and therefore weak stochastic mixa¬ 
bility, for squared loss on the unbounded domain M. As discussed by Vovk (2001, Section 5), 
this is a case where the classical mixability analysis does not apply. 

5. Intermediate Rates: The Central Condition, the Margin Condition 
and the Bernstein condition 

In this section, we weaken the //-central and //-PPG conditions to the n-central and v- 
PPC conditions, which allow // = v{e) to depend on e according to a function v that is 
allowed to go to 0 as e goes to 0. In the main result of this section. Theorem 5.4 in 
Section 5.1, we establish that for bounded loss functions, these weakened versions of our 
conditions are essentially equivalent to a generalized Bernstein condition which has been 
used before to characterize fast rates. Section 5.2 shows that, for unbounded loss functions, 
the one-sidedness of our conditions allows them to capture situations in which fast rates are 
attainable yet the Bernstein condition does not hold — although there are also situations in 
which the Bernstein condition holds whereas the n-central condition does not for any allowed 
V (although the u-PPC condition does). Thus, as a corollary we find that the equivalence 
between the central and PPG condition breaks for the weaker, //-versions of these conditions. 
Section 5.3 illustrates that //-stochastic mixability can be weakened similarly to //-stochastic 
mixability and relates this to a condition identified by Juditsky et al. (2008). Finally, in 
Section 5.4 we apply Theorem 5.4 to show how the central condition is related to (non-) 
existence of unique risk minimizers. 

5.1 The //-Conditions and the Bernstein Condition 

Empirical risk minimization (ERM) achieves fast rates if the random deviations of the 
empirical excess risk are small compared to the true excess risk. As shown by Tsybakov 
(2004), this is the case in classification if the Bayes-optimal classifier is in the model F" and 
the so-called margin, which measures the difference between the conditional probabilities of 
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the labels given the features and the uniform distribution, is large. Technically, the random 
deviations can be controlled in this case, because the second moment of the excess loss can 
be bounded in terms of the first moment. In fact, as shown by Bartlett and Mendelson 
(2006), this condition, which they call the Bernstein condition^ is sufficient for fast rates for 
bounded losses in general, even if the Bayes-optimal decision is not in the model. Precisely, 
the standard Bernstein condition is defined as follows: 


Definition 5.1 (Bernstein Condition) Let /3 G (0,1] and B > 1. Then satis¬ 

fies the (/3, B)-Bernstein condition if there exists an f* ^ P such that 


E 




< B 


E \f,(z)-e,.{z)\ 


for all / G T". (37) 


This standard definition bounds the second moment in terms of the polynomial function 
u{x) = Bx^ of the first moment.® The exponent fi is most important, because it determines 
the order of the rates, whereas the scaling factor B only matters for the constants. To 
draw the connection with the central condition, however, it will be clearer to allow general 
functions u instead of x i— Bx^. Following Koltchinskii (2006) and Arlot and Bartlett 
(2011), we then bound the variance instead of the second moment, which is equivalent with 
respect to the rates that can be obtained: 

Definition 5.2 (Generalized Bernstein Condition) Let u : [0, oo) —)• [0, oo) be a non¬ 
decreasing function such that u{x) > 0 for all x > 0, and u{x)/x is non-increasing. We say 
that (£,'P,T') satisfies the u-Bernstein condition if, for all P gV, there exists an T-optimal 
f* G P (satisfying (3)) such that 


Var {if{Z) - <u(b [£f{Z) - £f.{Z)] for all fGP. (38) 

\Z'^y j 

In particular u[x) = is allowed for fi G [0,1], or, more generally, it is sufficient if 
u{0) = 0 and u is a non-decreasing concave function, because then the slope u(x)/x = 
(u(x) — m(0))/x is non-increasing; for a concrete example see Example 5.5 below. 

Similar generalizations have been proposed by Koltchinskii (2006) and Arlot and Bartlett 
(2011)^. For bounded losses, our generalized Bernstein condition is equivalent to a gener¬ 
alization of the central condition in which rj = v{e) is allowed to depend on e according 
to some function v, which in turn is equivalent to the analogous generalization of the 
pseudoprobability-convexity condition. We first introduce these generalized concepts and 
then show how they are related to the Bernstein condition. They are defined as imme¬ 
diate generalizations of their corresponding definitions, Definition 3.1, Equation (12) and 
Definition 3.2, Equation (15): 

6. The Tsybakov condition with exponent q (Tsybakov, 2004) is the special case that the (d, B)-Bernstein 
condition holds for B < oo, q = /3/(l — /3), additionally requiring £ to be classification loss and J- to 
contain the Bayes classiher for P. 

7. They require u to be of the form w'^ where w is a concave increasing function with w(0) = 0. In their 
examples, is also concave, a case which is subsumed by our condition, but they additionally allow 
concave w with convex u = vP, which is not covered by our condition. On the other hand, our condition 
allows u with non-concave yTt, which is not covered by theirs. For example, u{x) = (x — 1/3)® -I- 1/27 for 
X < 1/2 and u(x) = x/12 for x > 1/2 satisfies our condition, but ^u{x) is nonconcave. So, in general, 
the conditions are incomparable. 
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Definition 5.3 (ii-Central Condition and u-PPC Condition) Letv: [0,oo) [0, oo) 

be a bounded, non-decreasing function satisfying v{x) > 0 for all x > 0. We say that 
{£,V,!F) satisfies the ii-central condition if, for all e > 0, there exists a function cj) : 
V ^ T such that (12) is satisfied with r] = f(e). We say that {i,V,T) satisfies the 
u-pseudoprobability convexity (PPG) condition if, for all e > 0, there exists a function 
if ‘.V ^ T such that (15) is satisfied with r] = v{e). 

If v{x) = rj for all X > 0 and v{0) = 0, then the x-central condition is equivalent to the weak 
ry-central condition. If u(x) = r] for all x > 0, then it is equivalent to the strong //-central 
condition. 

Now consider a decision problem such that Assumption A holds. Theorem 5.4 

below in conjunction with Proposition 3.9 implies that the generalized Bernstein condition 
with function u, the u-central condition and the u-PPC condition are then all equivalent 
for bounded losses in the sense that one implies the other if 

v{x) ■ u{x) = c • X for all sufficiently small x, (39) 

where c is a constant whose value depends on whether we are going from Bernstein to central 
or the other way around. In particular, if we ignore the unimportant difference between 
the second moment of if{Z) — if*{Z) and its variance, we see that the (1, Bj-Bernstein 
condition and the //-central condition are equivalent for // = c/B. 

Define the function k{x) := (e* — x — l)/x^ for x 7 ^ 0, extended by continuity to 
/c(0) = 1/2, which is positive and increasing (Freedman, 1975). 

Theorem 5.4 For given {l,V,F), suppose that the losses if take values in [0,a]. 

1. If the u-Bernstein condition holds for a function u satisfying the requirements of Def¬ 
inition 5.2 (so that Assumption A holds), then 

(a) The v-central condition holds for 

. , Cl X 

v{x) = -yw A6, 
u{x) 

where b > 0 can be any finite constant and c\ = 1/K{2ba); and if u{0) = 0 we 
read0/u{0) as lim inf 3,^0 


(b) Additionally, for each P £ V, any P-optimal f* for P, and any 6 > 0, we have 

< 1 for all f with R{P, f) — R{P, f*) > 6, where rj = v{6). 


2. On the other hand, suppose that Assumption A holds. If the v-pseudoprobability con¬ 
vexity condition holds for a function v satisfying the requirements of Definition 5.3 
such that x/v{x) is nondecreasing, then the u-Bernstein condition holds for 


u{x) 


C2X 

v{x) ’ 


where C 2 = 6/k(— 26a) for b = sup 3 ,u(x) < 00 ; and if v{0) = 0 we read 0/v{0) as 
lim^^io x/v{x). 
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We are mainly interested in Part 1(a) of the theorem and its essential converse, Part 2. Part 
1(b) is a by-product of the proof of 1(a) that will be useful for the proof of Proposition 5.11 
below as well as the proof of the later-appearing Corollary 7.8. Part 2 assumes that the 
u-PPC condition holds for v such that sup 3 ,>qu(x) < oo. This boundedness requirement 
is without essential loss of generality, since we already assume that losses are in [0,a]. 
From the definition this trivially implies that, if the u-condition holds at all, then also the 
u'-condition holds for v'{x) = v{x) /\ a', for any a' > a. 

Example 5.5 (Example 2.3 and 3.8, Continued) Let 1" be a bounded loss function and 
suppose that the u-Bernstein condition holds with u(x) = Bx^ for some /? G [0,1]. We first 
note that if /3 = 0, then the condition holds trivially for large enough B. Theorem 5.4 shows 
that, in this case, we have the u-central condition for some v being linear in a neighborhood 
of 0, in particular liminfj,|o ^(^c)/^^ < cc. Thus, for bounded losses, the u-central condition 
always holds for such v. Thus we will say that the u-central condition holds nontrivially 
if it holds for v with \im.va.ixiQv{x)/x = oo. Since the trivial u-condition always holds, 
it provides no information and therefore, under this condition, one can only prove (using 
Hoeffding’s inequality) the standard slow rate of 0{l/y/n). The other extreme is when we 
have the ry-central condition, i.e. the u-condition holds with constant v, which as we show 
in Theorem 7.6 leads to rates of order 0(l/n). Moreover, as we show in Corollary 7.8, it 
also is possible to recover intermediate rates under the general case of the u-central condi¬ 
tion. Specifically, under the u-central condition, we get improbability rates of O {w{l/n)), 
where we recall that w is the inverse of the function x i—>• xv{x). In the special case of 
V : e (for which the behavior in terms of e corresponds to the (/3, i?)-Bernstein 

condition as shown by Theorem 5.4), we get the rate just as we do from the 

(/3, i?)-Bernstein condition. ■ 


The proof of Theorem 5.4 is deferred until Appendix A.3. It is based on the following 
lemma, which adds a (non-surprising) lower bound to a well-known upper bound used e.g. 
by Freedman (1975) in the context of concentration inequalities. Since most authors only 
require the upper bound, we have been unable to find a reference for the lower bound, 
except for Lemma C.4 in our own work (Koolen et ah, 2014). Interestingly, the Lemma is 
applied in the proof of Theorem 5.4 with a ‘frequentist’ expectation over Z € Z to prove 
the first part, and a ‘Bayesian’ expectation over f £ B to prove the second part. 

Lemma 5.6 For any random variable X taking values in [—a,a], 

K(-2a) Var(A) < EfX] + logE[e"^] < K(2a) Var(X), (40) 

where the function k is as defined above Theorem 5.f. 

Proof Define the auxiliary function k'{x) = e^ — x — 1. Then 

E[X] + logE[e-^] = min E[k'(^ - X)], 

as may be checked by observing that E[k'(^ — A)] = E[e“^] — ir + E[A] — 1 is minimized 

at /i = — logE[e“^]. Since k'{x) = k{x)x‘^ and k{x) is increasing (Freedman, 1975), we 
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further have 


E[k'(^-X)] 


< max^/^^g[_a,a] - x) E[{n - X)‘^] 

> min^',xe[-a,a] - x) E[(^ - Xf] 


K{2a)E[{fi - Xf] 

K{-2a)E[{fi-X)% 


(41) 


from which the lemma follows upon observing that min^g [_3 ,j] E[(^ — X)‘^] = Var(X). ■ 


5.2 Bernstein vs. Central Condition for Unbonnded Losses - Twfo-sided vs. 

One-sided Conditions 

Applying Proposition 3.9 with rj = v{e) for all e > 0 immediately gives that, under no 
further assumptions, the u-central condition implies the u-pseudoprobability convexity con¬ 
dition. Combined with Theorem 5.4 this shows that the central condition and the Bernstein 
condition are essentially equivalent for bounded losses, so it is natural to ask how the v- 
versions of our conditions are related to the Bernstein conditions for unbounded losses. In 
that case there are two essential differences. One difference is that the variance or second 
moment in the Bernstein condition is two-sided in the sense that it is large both if the 
excess loss if{Z) — if*{Z) gets largely negative with significant probability, but also if the 
excess loss is large, whereas the central condition is one-sided in that large excess losses 
only make it easier to satisfy. This difference is illustrated by Example 5.7 below, where 
fast rates can be obtained and the central condition holds, but the Bernstein condition fails 
to be satisfied. The second difference is that the u-central condition essentially requires 
the probability that if*{Z) — if{Z) is large is exponentially small. Hence, if the loss is 
unbounded and has only polynomial tails, then the u-central condition cannot hold. Yet 
Example 5.8 shows that in such a case, the u-Bernstein condition can very well hold for 
nontrivial u. However, we should note that the u-PPC condition and the u-stochastic mix¬ 
ability conditions (introduced in the next subsection) also do not require exponential tails; 
hence it may still be that whenever the u-Bernstein condition holds, u-stochastic mixability 
also holds with u{x) ■ v{x) x x; we do not know whether this is the case. 

Example 5.7 (Central without Bernstein for Unbounded Loss) Consider density 
estimation for the log loss. For the univariate normal density with mean /r and variance 
1, let P be the normal location family and let P = {/^ : ^ G M} be the set of densities of 
the distributions in P. Then, for any P G P with density fp, the risk R{P, f) is minimized 
by f* = fu, since the model is well-specified. 

Let Zi,...,Zn be an iid sample from P G P. Then, as can be verified by direct 
calculation, the empirical risk minimizer/maximum likelihood estimator relative to P, 
:= satisfies — p)^ = 1/n, which translates into an expected 

excess risk of 

.. + = 

such that ERM obtains a fast rate in expectation. One would therefore want a condition that 
aims to capture fast rates to be satisfied as well. For the central condition, this is the case 
with r/ = 1, as follows from Example 2.2. However, as we show next, the (1, P)-Bernstein 
condition does not hold for any constant B. 
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Consider P G V with density f^, and abbreviate Ufj,{z) = — log/^( 2 ;) + logfi,{z) = 
^ 2 ^ + z{v — fi). Then 


E |(/„(Z )1 

E PliZ)] 

Zr^P'- ^ ■' 


p + 




E [Z‘^] + 2{u-ij) E [Z] 


Zr^P 
\2n I .,2 


Z~P‘ ^ 2 
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p — P ^ f p — P 


{v - /i) (1 + ) + {y - ppp - p ) + 


2 2 \ 2 
p - 
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First consider the case that the ‘true’ mean z/ > 0. Then for all constants B the {1,B)- 

Bernstein condition fails to hold. To see this, first observe that for any /r satisfying /x < 0 

( 2 2 \ ^ 

^ ] since p — n>Q and p > 0. Second, observe 

that 'Ezr^p[UpZ)] < p P P since —^ip < ^ . Hence, the following condition is weaker 

than the (1, H)-Bernstein condition: 

ip-p'^f <4B{P + P). 


Choosing fi to satisfy p < p leads to the even weaker condition [pj < 4:B(2p) which 

fails as soon as |/x| > \/32B. It remains to show that the (1, B)-Bernstein also fails to hold 
for all B if the true mean p < 0; this is shown using a symmetric argument by considering 
/X > 0 and —fi < p. The result follows. ■ 


Critically, the Bernstein condition cannot hold because of the two-sided nature of the second 
moment, which is large, not just if some is better than f* with significant probability, 
but also if it is much worse. Thus, the fact that certain are so highly suboptimal that 
they suffer high empirical excess risk with high probability (and hence are easily avoided by 
ERM) ironically is what causes the Bernstein condition to fail; a related point is made by 
Mendelson (2014). The next example shows that, if Z has two-sided, polynomial tails then 
the opposite phenomenon can also occur: the u-central condition does not hold for any v, 
but we do have the n-Bernstein condition for constant u. 


Example 5.8 Let V be an arbitrary collection of distributions over M such that for all 
P G P, the mean /xp := Ezr^plZ] G [—1,1]. Consider the squared loss tj^{z) = ^{z — /)^, 
with P = [—1,1]. Assume that V contains a distribution P* with /xp* = 0 and, for some 
constants ci, C 2 > 0, for all z G M with j^;] > ci, the density p* of P* satisfies p*{z) > c^j. 
The predictor in T that minimizes risk is given by f* = 0. Now with such a V, for all 
r/ > 0, all /X / 0, and using that = 2ZpL — p?, we find for C 3 = C 2 • exp(—ry/x^), 


E 




> 





00 , 


(42) 


so that the u-central condition fails for all v of the form required in Definition 5.3. Hence 
the u-central condition does not hold — although from Example 5.10 below we see that 
u-stochastic mixability (and hence the u-PPC condition) does hold for v{x) x ^/x. 
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Now consider a V with means in [—1,1] and containing a P* as above such that addi¬ 
tionally for all P € V, the fourth moment is uniformly bounded, i.e. there is an A > 0 
such that for all P £ P, PiZr^p[Z^] < A. Clearly we can construct such a P and by the 
above it will not satisfy the u-central condition for any allowed v. However, the u-Bernstein 
condition holds with u{x) = -|- I)®, since, using again {Z) = —2Z^ + fj?, 

we find 

(^e^{Z) - ef.{Z)y = E - 4Z//3] < 4\/I^2 ^ ^4 < ^(^2) ^ 


5.3 u-Stochastic Mixability and the JRT Conditions 

Just as Definition 5.3 weakened the 77 -central and PPG conditions to the u-central and PPC 
conditions, we similarly may weaken the main conditions of Section 4, stochastic mixability 
and its special case stochastic exp-concavity, to their u-versions: 

Definition 5.9 (u-Stochastic Mixability and u-Stochastic Exp-Concavity) Let v: 

[ 0 ,oo) — >■ [ 0 ,oo) be a bounded, non-decreasing function satisfying v{x) > 0 for all x > 0 . 
We say that {i,P,P,Pd) is u-stochastically mixable if, for all e > 0 , there exists a function 
(j) : P ^ Pd such that ( 22 ) is satisfied with 77 = v{e). If Pd D co{P) and this holds for the 
function 7/^(n) = Efr^nif] M e > 0 , then we say that {i,P,P,Pd) is u-stochastically- 
exp-concave. 

The main insight of Sections 3 and 4 was that the 77 -central condition, r 7 -PPC condition 
and 77 -stochastic mixability are all equivalent under some assumptions. One may of course 
conjecture that the same holds for their weaker u-versions. We shall defer discussion of this 
issue to Section 8 and for now focus on the usefulness of u-stochastic exp-concavity, which 
can lead to intermediate rates even for unbounded losses. 

A special case of u-stochastic exp-concavity, which we will call the JRT-I condition, 
was stated by Juditsky et al. (2008); recall that we discussed the JRT-II condition in 
Section 4.2.3. The JRT-I condition® states that, for every 77 > 0, the excess loss can be 
decomposed as 

If{z) > il^\z,f,f*) - rr,{z) for all z, any f,f* £ co{P), 

where Z —> M does not depend on /,/*, and, for any f* £ co{P), £^\z, f*, f*) = 0 
and £^\z, f, f*) is 1-exponentially concave as a function of / G co{P) {i.e., (25) holds 
with r]if{z) = {z, /,/*)). Note that the choice of £1} and in general depends on 77 . 

Juditsky et al. (2008) show that, under this condition, fast rates can be obtained in, for 

8. The assumption is stated in basic form in their Theorem 4.1; their Q 2 is our and their R is our 
the dependence of r,, on g (their 1//3) is made explicit in their Corollary 5.1. 
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example, regression problems with a finite number of regression functions, where the rate 
depends on how Srj := suppg-p [r^(Z)] varies with r]. 

We now connect the JRT-I assumption to u-stochastic exp-concavity. Consider again 
the substitution function V’lbt) := Eg..^nb] as in Definition 4.7. Letting g = V’(n), the JRT-I 
assumption implies that 


E 


4..„b,(Z) + bog^E 


= E 

Zr^P 


-log E 
V g~n 


< E 

Zr^P 

(a) 

< E 

Zr^P 


-log E 

V 3~n 


- log 
V 


— E [r^(Z)] < £^, 


where (a) follows by the ??-exp-concavity of The derivation shows that, if the JRT-I 
condition holds for each g with function r^{z) then we have ? 7 -stochastic exp-concavity up 
to e-ri := suppgp E^..^p[rr;(Z)]. In their Theorem 4.1 they go on to show that, for finite F, 
by applying the aggregating algorithm at learning rate rj and an on-line to batch conversion, 
one can obtain rates of order 0(log \F\/{nr]) + e^j), for each r/. They go on to calculate 
as function of g in various examples (regression, classification with surrogate loss functions, 
density estimation) and, in each example, optimize r/ as a function of n so as to minimize the 
rate. Now for each function in their examples, there is a corresponding inverse function 
V that maps £ to g rather than vice versa, so that if the JRT-I condition holds for e^, then 
u-stochastic exp-concavity holds. Rather than formalizing this in general, we illustrate it 
informally using their regression example (Juditsky et ah, 2008, Section 5.1): 

Example 5.10 (JRT-I Condition and Regression) JRT consider a regression problem 
in which F is finite and suppgp ||/||p,oo < oo for all f £ F, where || • ||p_oo denotes the 
Loo(T’x)-norm. They further assume that a weak moment assumption holds: for all P £ V, 
E(x,y)~p[|E|^] < oo for some s > 2. They show that in this setting there exist constants 
ci,C 2 ,C 3 ,C 4 > 0 such that for all y £ R, r^(y) < ci|y| • l\y\ > 02 / 1 ]} Pyc^y"^ • |I|y| > Ci/^/y\■ 
Bounding expectations of the form |y|“- [[|y| > 6 ]] in the same way as one bounds expectations 
of indicator variables [[|y| > bj in the proof of Markov’s inequality, this gives that 


£jj — O 



which is strictly increasing in y. Thus, the inverse u(e) of is well-dehned on e > 0 and 
satishes u(e) = 0(e^/^). Since the JRT-I condition implies that, for all y > 0, we have 
r/-stochastic exp-concavity up to e if e > e^, it follows that for all e > 0 , we must have 
r/-stochastic exp-concavity up to e for r/ < u(e). It follows that u-stochastic exp-concavity 
holds with u(e) = 0(e^/*). In this unbounded loss case, we can easily obtain a rate by 
using the aggregating algorithm with online-to-batch conversion. Applying Proposition 4.5 

with the optimal choice of £ yields a rate of 2 ( ^ 1 , which coincides with the 


rate obtained by Juditsky, Rigollet, and Tsybakov (2008) in their Corollary 5.2 and the 
minimax rate for this problem (Audibert, 2009). ■ 
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5.4 The ti-Central Condition and Existence of Unique Risk-Minimizers 

Corollary 3.11 showed that, under Assumption A, strong 77 -fast rate (i.e. central and PPC) 
conditions imply uniqueness of optimal /*’s. Here we extend this result, for bounded loss, 
to the u-fast rate conditions, and also provide a converse, thus completely characterizing 
uniqueness of f* in terms of the u-central condition, for bounded losses. To understand 
the proposition, note that for two predictors with the same risk, R{P, f) = R{P,f*), it 
holds that / and f* achieve the same loss almost surely, so they essentially coincide, if and 
only if Yarzr^p[if{Z) — = 0. In the proposition we use = {/*} U {/ G : 

Yarz^p[if{Z) — if*{Z)] > e} to denote the subset of P where all /’s that are very similar 
to, but not identical with, f* have been taken out. 

Proposition 5.11 (u-central condition and (non-)uniqueness of risk minimizers) 

Fix {i,{P},P) such that the loss i is bounded and Assumption A holds, and let f* be an 
F-risk minimizer for P. Exactly one of the following two situations is the case: 

1. The V-central condition holds for some v that is sublinear at 0, i.e. limaj^o v{x)lx = 00 . 
In this case, f* is essentially unique, in the sense that for every sequence fi, f 2 ,... G F 
such that 'FiZr..p\Ifj{Z)] —)• 'FtZr.-p[^}*{Z)], we have YaTzr..p \if.{Z) — If*{Z)\ —)> 0. 
Moreover, for every s > 0, {£, {P}, F^) satisfies the rj-central condition for some 
rj > 0. 

2. The v-central condition only holds trivially in the sense of Example 5.5, i.e. it does 
not hold for any v with \mixiQv{x)/x = 00 . In this case, f* is essentially aoa-unique, 
in the sense that there exists e > 0 and a sequence fi, f 2 , ■■ ■ G F (possibly identical 
for all large j) such that Ezr^p[Ifj{Z)] —>■ Ezr^p[if*{Z)], but, for all sufficiently large 
j, Yarzr...p [if-{Z) — if*{Z)'\ > e. Moreover, for some e > 0, {l,{P},Fe) does not 
satisfy the rj-central condition for any r] > 0. 

Proof For Part 1, Proposition 3.9 implies that the u-PPC condition holds. Now Part 2 of 
Theorem 5.4 implies that the n-Bernstein condition holds with u such that limj-j^o uix) = 
\\mx\.Qx/v{x) = 0 by assumption. Then it follows from the definition of the u-Bernstein 
condition that f* is essentially unique. Moreover, by Part 1(b) of Theorem 5.4, there exists a 
function P with P{x) > 0 for x > 0, such that for every <5 > 0, (£, {P}, {f*}G)Q) satisfies the 
r/-central condition with rj = v'{5) > 0 for any subset Q P {f G F ■. R{P, f) — R{P, f*) > <5}. 
Now since the u-Bernstein condition holds with lim 3 ,|o= 0) we know that, for every 
e > 0, there is a 5 > 0 such that Yarzr^p[If{Z) — £f* (Z)] > e implies R{P, f) — R{P, f*) >6. 
For this 6, Q = {f G F : Yarzr..p[£f{Z) — £f*{Z)] > e} is a subset of {f G F : R{P, f) — 
R{p,n > 5}, and consequently, as already established, (£, {P}, {/*} U Q) must satisfy the 
r/-central condition for r] > 0, which is what we had to prove. 

For Part 2, to show nonuniqueness of /*, note that by Theorem 5.4, Part 1, the u- 
Bernstein condition cannot hold for any u with lim^-^o u{x) = 0. This already shows that 
there exists a sequence as required, for some e > 0 , so that f* is essentially non-unique. 
Since YarZr...p[ifj{Z) — if*{Z)] > e for all elements of the sequence and R{P, fj) —)■ P(P), 
the hrst inequality of Lemma 5.6 applied with X = rj{if-{Z) — lf*{Z)) now gives that, for 

all r/ > 0 , there exists fj such that logE^..^p ^ gQ 77 -central 

condition does not hold. ■ 
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6. From Fast Rates for Actions to Fast Rates for Functions 

Let i: )-Mbea loss function, where is a set of possible outcomes and A is a set of 

possible actions. Then our abstract formulation in terms of can accommodate 

unconditional problems, where distributions P € V are on Z = y and both P and are 
subsets of A; but it can also capture the conditional setting, where we observe additional 
features from a covariate space A. In that case, outcomes are pairs (X, Y) from Z' = 
A X y, the model P' and decision set P'^ are both sets of functions {/: A —)• P} from 
features to actions, and the loss is commonly defined in terms of the unconditional loss as 
=i{f{x),y). 

It may often be easier to establish properties like stochastic mixability for the uncondi¬ 
tional setting than for the conditional setting. In this section we therefore consider when 
we can lift conditions for unconditional problems with loss I to the conditional setting with 
loss P. For the condition of being r/-stochastically mixable, this is done by Proposition 6.1 
below. And, in Example 6.2, it will be seen that, in some cases, this also allows us to obtain 
the T/-central condition for the conditional setting. 

Proposition 6.1 is based on the construction of a substitution function ip': A(A') —)> P^ 
for the conditional setting from the substitution function tp: A (A) —)• Ad for the uncondi¬ 
tional setting. This works by applying p; conditionally on every x G A: first, note that any 
distribution 11 on functions f G P', induces, for every x G A, a distribution Ha; on actions 
A by drawing / ~ 11 and then evaluating /(x). We may therefore define V'^(n) = /n with 
/n the function 

/n(x) = (43) 

The conditions of the proposition then amount to the requirement that this is a valid 
substitution function in the conditional setting. 

Proposition 6.1 Let {i,V,P,Pd) and ,V', P', P'^) correspond to the unconditional and 

conditional settings deseribed above, and assume all of the following: 

• (£,'P, A, Ad) satisfies y-stochastic mixability up to s with substitution function ip; 

• P{Y\X) G V for every P G Vp 

• the function /n from (43) is measurable and contained in pp, for every 11 G A(A'). 

Then y-stochastic mixability up to e is satisfied in the eonditional setting. In particular, fu 
is contained in pp if: 

• pp is the set of all measurable functions from A to A; or 

• {i,V,P,Pd) is y-stochastically exp-concave up to e, and pp contains the convex hull 
of P'. In this ease, {£',V', P', pp) is also y-stochastically exp-concave up to e. 

We recall from Section 4.2.2 that r/-stochastic exp-concavity is the special case of ry-stochastic 
mixability where the substitution function maps 11 to its mean. In addition, for yy-stochastic 
exp-concavity the weak and strong versions of the condition coincide. 
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Proof We verify i^-stochastic mixability up to e for (£', P', by using ? 7 -stochastic 

mixability for (f, P, J'd) conditional on each x G X: for any P gV and If £ 


E 

P{X,Y) 


V(n) 




= E E 

P{X)P{Y\X) 


< E E 

P(X) P(Y\X) 


-ilog E 
^ nx(A)L 


,-ViA{Y) 


+ s 


= E 

P{X,Y) 


-ilog E 

^ n(/) 


-r,l'f{X,Y) 


+ 


which was to be shown. 

Verifying that /n G is trivial if is the set of all measurable functions. And if 
{£,V,P,Pd) is r/-stochastically exp-concave up to e, then fu{x) = En[/(x)] for all x, and 
therefore /n is the mean of 11 also in the conditional setting. ■ 


The most important application is when V contains all possible distributions on V, 
which means that the unconditional problem is classically mixable in the sense of Vovk (see 
Section 4.2.1). Then the requirement that P{Y \ X) G V is automatically satisfied. 


Example 6.2 (Squared Loss for Misspecified Model) As discussed in Example 4.6, 
the squared loss is ry-exp-concave in the unconditional setting on a bounded domain Pd V 
P = Z = for r] = 1/45^. If we make the setting conditional by adding features, 

and consider any set of regression functions P' and any set of joint distributions P\ then 
Proposition 6.1 implies that we still have exp-concavity as long as we allow ourselves to make 
decisions in the convex hull of P', i.e. if P'^ D co{P'). Note that this holds even if the model 
P is misspecified in that it does not contain the true regression function x i—?■ E[y | V = x]. 
If, furthermore, the model P' is itself convex and satisfies Assumption A relative to V', 
i.e. the minimum risk minjgjr/ E(x,y)~p(h^ ~ /(^))^ is achieved for all P G V', then we 
may take P'^ = P' and recover the setting considered by Lee et al. (1998). Even though 
this does not require P' to be well-specified, the strong version of Assumption B (which 
implies Assumption A) is then still satisfied, and hence Proposition 4.12 and Theorem 3.10 
tell us that {i',V,P) satisfies both the strong r/-pseudoprobability convexity condition and 
the strong ly-central condition. ■ 


The example raises the question whether we cannot directly conclude, under appropriate 
conditions, that, if the r/-central condition holds for some unconditional {i,V,P), then it 
should also hold for the corresponding conditional , V ,P'). We can indeed prove a trivial 
analogue of Proposition 6.1 for this case, as long as P' contains all measurable functions 
from X to V; we implicitly used this result in Example 3.7. Example 6.2, however, shows 
that, if one can first establish r/-stochastic exp-concavity for (£, V, P, Pd), one can sometimes 
reach the stronger conclusion that {£' ,V' ,P') satisfies the r/-central condition as long as P' 
is merely convex, rather than the set of all functions from X to V. 
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7. The Central Condition Implies Fast Rates 

In this section, we show how a statistical learning problem’s satisfaction of the strong 
r/-central condition implies fast rates of 0(l/n) under a bounded losses assumption. Theo¬ 
rem 7.6 herein establishes via a rather direct argument that the strong ry-central condition 
implies an exact oracle inequality (i.e. with leading constant 1) with a fast rate for finite 
function classes, and Theorem 7.7 extends this result to VC-type classes. We emphasize 
that the implication of fast rates from the strong r/-central condition under a bounded 
losses assumption is not itself new. Specifically, for bounded losses, the central condition 
is essentially equivalent to the Bernstein condition by Theorem 5.4, and therefore implies 
fast rates via existing fast rate results for the Bernstein condition. For instance, for finite 
classes Theorem 4.2 of Zhang (2006b) implies a fast 0(l/n) rate by letting Iq be our excess 
loss — ^f* assumed to satisfy the bounded loss condition therein, setting a = 0, taking 
n to be the uniform prior over a finite class J-, and taking p as for some sufficiently 
small constant C. In addition, Audibert (2004) showed fast rates for classification under 
the Bernstein condition®; see for example Theorem 3.4 of Audibert (2004) along with the 
discussion of how the variant of the (CAS) condition needed there is related to the (CAl) 
condition connected to VC-classes. However, since we posit the one-sided central condition 
rather than the two-sided Bernstein condition as our main condition, it is interesting to 
take a direct route based on the central condition itself, rather than proceeding via the 
Bernstein condition. As an added benefit, this approach turns out to give better constants 
and a better dependence on the upper bound on the loss. 

We proceed via the standard Cramer-Chernoff method, which also lies at the heart of 
many standard (and advanced) concentration inequalities (Boucheron et ah, 2013). This 
method requires an upper bound on the cumulant generating function. We solve this sub¬ 
problem by solving an optimization problem that is an instance of the general moment prob¬ 
lem, a problem on which Kemperman (1968) has conducted a detailed geometric study. This 
strategy leads to a fast rates bound for finite classes, which can be extended to parametric 
(VC-type) classes, as shown in Section 7.3. 

7.1 The Strong Central Condition and ERM 

For the remainder of Section 7, we will consider the conditional setting, where the loss if{Z) 
takes values in the bounded range [0, V] for outcomes Z = (X,Y) G A x V and functions 
/ from A = {/: A —7- A.}. We take V = {P} to be a single fixed distribution and we will 
assume throughout that {i, {P}, A) satisfies the strong r?-central condition for some rj > 0. 
That is, there exists /* G P such that 

log E exp(—r/IFj) < 0 for all / G P, (44) 

Z'^P 


where we have abbreviated the excess loss by Wf{Z) = tf{Z)—if*{Z); for brevity we further 
abbreviate Wf{Z) to Wj in this section. Then, by Jensen’s inequality, f* is P-optimal for 
P. We let rj* denote the largest p for which (44) holds. 


9. Audibert actually introduces multiple conditions, referred to as variants of the margin condition, but 
these actually are closer to Bernstein-type conditions as they take into account the function class A. 
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An empirical measure Pn associated with an n-sample Z, comprising n independent, 
identically distributed (iid) observations {Zi, ..., Zn) = ((Xi, Yi), ..., (X„, Yn)), operates 
on functions as Pn f = ^ '^]=i losses as Pnf-f = 

Cramer-Chernoff. We will bound the probability that the ERM estimator 

1 ” 

/z :=argmin-(45) 

selects a hypothesis with excess risk R{P, f) — R{P, f*) = E[ldj] above ^ for some constant 
a > 0. For any real-valued random variable X, let rj i—)• Rxiv) — logEe*^^ denote its 
cumulant generating function (CGF), which is known to be convex and satisfies A'(0) = 
E[X]. 

Lemma 7.1 (Cramer-Chernoff) For any f € J-, rj > 0 and t G M, 

/ n n \ 

*{Zj)-\-t\ < ex.p [rjntnA-Wfiv)) ■ (46) 

V J=i i=i / 

Proof Applying Markov’s inequality to 6“*^” and using the fact that A_„ p„ Wf iv) = 
nk-Wfiv) for ifo observations, yields 

Pr (- Pn Wf > -t) < exp {rjnt A_„p„ Wfiv)) = exp {pnt + nA-Wfir])) , 

from which the lemma follows. ■ 


7.2 Semi-infinite Linear Programming and the General Moment Problem 

We first consider the canonical case that Wf takes values in [—1,1] (i.e., V = 1), that 
A_Wj- (?/*) = 0 with equality (as opposed to the inequality in Equation 44) and that E[Wj] = 
a/n for some constant a > 0 that does not depend on /. These restrictions allow us to 
formulate the goal of bounding the CGF as an instance of the general moment problem of 
Kemperman (1968, 1987). We will later relax them to allow general V, A-wr^ij]*) < 0 and 
Fj[Wf] > a/n. 

As illustrated by Figure 3, our approach will be to bound A-Wjip) at ry = rj*/2 from 
above by maximizing over all possible random variables Wf subject to the given constraints. 
This is equivalent to minimizing — E[exp((r/*/2)S')] over S = —Wf and may be formulated 
as an instance of the general moment problem, which we describe next. 

The general moment problem. Let A(5) be the set of all probability measures over a 
measurable space S. Then for any real-valued measurable functions h, gi,..., gm on S and 
constants ki, ..., km, the general moment problem is the semi-infinite linear program 


inf E hiS) 

P£A{S) 

subject to E gj{S) = kj, j = 1,... ,m. 


(47) 
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Figure 3: Control of the CGF of —Wj for a function / with excess loss E[VFj] of order 
The derivative at 0 equals — E[W/]. 


Define the vector-valued map g: S ^ M"* as g{s) = {gi{s),... ,gmis)) and the vector k = 
(ki,..., km)- Then Theorem 3 of Kemperman (1968), which was also shown independently 
by Richter (1957) and Karlin and Studden (1966), states that, if A: G int co(g((5)), the 
optimal value of problem (47) equals 


supi do + '^djkj : d* = {do,di,.. .,dm) ^ D*\, 
^ 1=1 ^ 


(48) 


where D* C is the set 


D* := id* = {do, di,, dm) G : h{s) > do + ^ djgj{s) for all s G (49) 

^ 1=1 ^ 

Instantiating, we choose S = [—1,1] and define 

/i,(s) =g'i(s)=s, g2{s) = e^*\ fci =--, A:2 = 1, 


which yields the following special case of problem (47): 

inf - E 
PeA([-i,i]) S~P 

a 

subiect to E 6 =- 

S~P n 

E = 1. 

Equation 48 from the general moment problem now instantiates to 

sup |do - ^di + d 2 -. d* = {do, di, d 2 ) G D*| , 
with D* equal to the set 

|d* = (do, di, d2) G > do + dix + d2e^*^ for all s G [—1,1]| . (52) 


(50a) 

(50b) 

(50c) 

(51) 


Applying Theorem 3 of Kemperman (1968) requires k G intcog'([—1,1]). We first char¬ 
acterize when k G co 5 ([—1,1]) holds and handle the intcog'([—1,1]) version after Theo¬ 
rem 7.3. The proof of the next result, along with all subsequent results in this section, can 
be found in Appendix A.4. 
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Lemma 7.2 For a > 0, the point k = l) G co( 5 ([—1,1])) if and only if 

a ^ e^* + e~^* — 2 cosh(? 7 *) — 1 

n “ eF — e~F sinh(r/*) 

Moreover, k G int co( 5 (([—1,1])) if and only if the inequality in (53) is strict. 

Note that (53) is guaranteed to hold, because otherwise the semi-infinite linear program 
(50) is infeasible (which in turn implies that such an excess loss random variable cannot 
exist). 

The next theorem is a key result for using the strong central condition to control the 
CGF. 


Theorem 7.3 Let f be an element of F with {if — if*){Z) taking values in [—1,1], n G N, 
Ezr..p{if - if*){Z) = ^ for some a > 0, and A_(£^_£^.)( 2 )(r/*) = 0 for some p* > 0. If 


a cosh(ry*) — 1 
n sinh(r 7 *) ’ 


(54) 


then 




-0.21(r/* Al)a 
n 


Corollary 7.4 The result of Theorem 7.3 also holds when the strict inequality in (54) is 
replaced with inequality, i.e. . 

Th oiriiii 7/ ) 

We now present an extension of this result for losses with range [0,14]. 

Corollary 7.5 Let gi{x) = x andy 2 = 1 be common settings for the following two problems. 
The instantiation of problem (47) with S = [—14,14], h{x) = g 2 {x) = and 

yi = —^ has the same optimal value as the instantiation of problem (47) with S = [—1,1], 
h{x) = — g 2 {x) = , and yi = 


7.3 Fast Rates 

We now show how the above results can be used to obtain an exact oracle inequality with 
a fast rate. We first present a result for finite classes and then present a result for VC-type 
classes (classes with logarithmic universal metric entropy). 


Theorem 7.6 Let {i,P,F) satisfy the strong g*-central condition, where |J-'| = N, i is a 
nonnegative loss, and supf^jrif{Z) < V a.s. for a constant 14. Then for all n > 1, with 
probability at least 1 — 6 


E [i 

Zr^P 


fz 


(Z)] < B [if. (Z)] + 


5max |l4, ^| (log ^ -h log N) 


n 
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Before presenting the result for VC-type classes, we require some definitions. For a 
pseudometric space {G,d), for any e > 0, let M{e,G,d) be the e-covering number of {G,d)] 
that is, N'{e,G,d) is the minimal number of balls of radius e needed to cover Q. We will 
further constrain the cover (the set of centers of the balls) to be a subset of G (i.e. to 
be proper), thus ensuring that the strong central condition assumption transfers to any 
(proper) cover of J-. Note that the ‘proper’ requirement at most doubles the constant K 
below, as shown in Lemma 2.1 of Vidyasagar (2002). 

We now present the fast rates result for VC-type classes. The proof, which can be found 
as the proof of Theorem 7 of Mehta and Williamson (2014), uses Theorem 6 of Mehta 
and Williamson (2014) and the proof of Theorem 7.6. Below, we denote the loss-composed 
version of a function class T as £ o := {£j^ : f £ T}. 

Theorem 7.7 Let satisfy the strong r]*-central condition with io T separable, 

where, for a constant K > 1, for each e G (0,iL] we have Af{£ o iF,L 2 {P),e) < (y)^; and 
su'Pf^jr£(Y, f{X)) < V a.s. for a constant V > 1. Then for all n > 5 and 6 < ^, with 
prohability at least 1 — 5, 


E 






1 

— max 
n 


8max|v, t| (C log(h:n)-t log f) , 

2V (l080C log(2h:n) -t 90y^(log f) C log(2h:n) -t log f) 


+ 


1 

n 


We have shown the fast rate of 0(l/n) under the best case of the u-central condition, 
i.e. when v is constant; however, it also is possible to recover intermediate rates for the case 
of general v. 


Corollary 7.8 Let (£, P,J-) satisfy the v-central condition hold for a finite class T. Then, 
for some constant c, for all n satisfying v ^ 5(log g^t an interme¬ 
diate rate of w ^ where w is the inverse of the function x (-)■ xv{x). 


Proof From part (2) of Theorem 5.4, the u-central condition implies the u-Bernstein 
condition for u{x) x x/v{x), and from part (lb) of Theorem 5.4, we then have the 77 -central 
condition for 77 = cv{5) for the subclass of functions with excess risk above 5, for some 
constant c. From here, a simple modification of the proof of Theorem 7.6 yields the desired 
result as follows. Let e correspond to the excess risk threshold above which ERM should 
reject all functions with high probability. Then, similar to the proof of Theorem 7.6, we 
upper bound the probability of ERM picking a function with excess risk e or higher: 


Vexp(r7A_vyj.(c7;(e)) = N exp(nA_iYj,/Y(cVv(e)) 


< N exp ^—0.21n(cVv(£) A l) 


Eor £ satisfying v(£} < the failure probability <5 is at most Vexp(—0.21c77,e7;(e)), and 


hence by inversion we get the rate w 


5 (log a-Log N) \ 

cn I ’ 
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8. Discussion, Open Problems and Concluding Remarks 

In this paper we identified four general conditions for fast and intermediate learning rates. 
The two main ones, which subsumed many previously identified conditions, where the cen¬ 
tral condition and stochastic mixability. We provided sufficient assumptions under which 
the four conditions become equivalent via the implications 

r/-central => //-predictor =► //-stochastic mixability => r/-PPC //-central. (55) 

In Section 3 and 4 we considered the versions of these conditions for fixed // > 0, as given 
by Theorem 4.17, Proposition 4.11, Proposition 4.12 and Theorem 3.10, respectively. For 
this fixed // > 0 case, all implications except one hold under surprisingly weak conditions, 
in particular allowing for unbounded loss functions. The exception is ‘central predictor’ 
(Theorem 4.17). Although even this result was applicable to some noncompact decision 
sets T with unbounded losses (Example 4.21), it requires tightness and convexity of the set 
P, although Example 4.18 shows that sometimes the implication holds even though V is 
neither tight nor convex. An important open question is whether Theorem 4.17 still holds 
under weaker versions of Assumption C or Assumption D. 

Another restriction of Theorem 4.17 is that, via Assumption D, it requires convexity 
of the decision set which fails for the 0/1-loss and its conditional version, the 

classification loss However, we may extend the definition of to F = [0,1] and define 

the resulting randomized 0/1 or absolute loss as := \y — f\- This can be interpreted as 

the 0/ 1-loss a decision maker expects to make if she is allowed to randomize her decision by 
flipping a coin with bias / — a standard concept in PAC-Bayesian approaches (Audibert, 
2004; Catoni, 2007). Eor the absolute loss, we can consider //-stochastic mixability for 
Fd = co{F) = [0,1], which is convex; hence, the requirement of convex Fd in Theorem 4.17 
is not such a concern. 

In Section 5 we discussed weakenings of the four conditions to their //-versions. Now 
for bounded losses, the four implications above still hold under similar conditions as for the 
hxed //-case. Since the hrst three implications in (55) were proven in an ‘up to e’ form for 
all e > 0, it immediately follows that for arbitrary functions v, the implications continue 
to hold under the same assumptions if the //-conditions are replaced by the corresponding 
//-conditions. This does not work for the fourth implication, since Theorem 3.10 is not given 
in an ‘up to e’ form (indeed, we conjecture that it does not hold in this form). However, we 
can work around this issue by using instead a detour via the Bernstein condition: by using 
first part 2 and then part 1 in Theorem 5.4, it follows that the //-PPG condition implies the 
//^central condition for v'{e) x v{£), so the four //-conditions still imply each other, under 
the same assumptions as before, up to constant factors. However, the Bernstein-detour 
works only for bounded losses, and Example 5.7, 5.8 and 5.10 together indicate that in 
general it cannot be made to work and indeed the analogue of (55) for the //-conditions 
does not hold for unbounded losses: for decision problems with polynomial rather than 
exponential tails on the losses, //-stochastic mixability and the //-PPG condition may hold 
whereas the //-central condition does not. Thus there is the question whether the central 
condition can be weakened such that the four implications for the //-versions continue to 
hold, under weak conditions, for unbounded losses — and we regard this as the main open 
question posed by this work. Another issue here is that, if in a decision problem {i,V,F) 
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that satisfies a ^-condition, we replace V by its convex closure, then the u-condition may 
very well be broken, so, once again, a weakening of Assumption D to nonconvex V seems 
required. Finally, it would be of considerable interest if one could show an analogue for 
unbounded losses of Proposition 5.11, which connects — for bounded losses — the central 
condition to the existence of a unique risk minimizer. Relatedly, it would be desirable 
to link this proposition to the results by Mendelson (2008a) who also connects slow rates 
with nonunique risk minimizers, and to Koltchinskii (2006) who gives a version of the 
Bernstein condition that does hold if nonunique minimizers exist, indicating that our rj- 
central condition (which via Proposition 3.3 implies unique minimizers) might sometimes 
be too strong. 

Apart from these implications in the ‘main quadrangle’ of Figure 1 on page 6, it would 
be good to strengthen some of the other connections shown in that figure, such as the 
precise relation between //-mixability and //-exp-concavity. It would also be desirable to 
establish connections to results in defensive forecasting (Chernov et ah, 2010) in which 
conditions similar to both the central condition and mixability play a role; their Theorem 
9 is reminiscent of the special case of our Theorem 4.17 for the case that Z is finite and V 
consists of all distributions on Z. 

We focused on showing equivalence of fast rate conditions and not on showing that one 
can actually always obtain fast rates under these conditions. For stochastic mixability, this 
immediately follows, under no further conditions, from Proposition 4.5. For the central 
condition, the situation is more complicated: in this paper we only showed that it implies 
fast rates for bounded loss functions. We know that, for the unbounded log-loss, fast rates 
can be obtained under the central condition (and no additional conditions) in a weaker 
sense, involving Renyi and squared Bellinger distance (Section 2.2); in work in progress, we 
aim at showing that the central condition implies fast rates in the standard sense even for 
unbounded loss functions. This does appear possible, up to log-factors, however it seems 
that here one does need weak additional conditions such as existence of certain moments 
different from the exponential moment in (4). 

Second, by ‘fast’ rates we merely meant rates of order 1/n; it would of course be highly 
desirable to characterize when the rates that are achieved under our conditions by appro¬ 
priate algorithms (ERM, Bayes MAP-style and MDL methods for the central condition, 
the aggregating algorithm for stochastic mixability) are indeed minimax optimal. Similarly, 
one would need examples showing that if a condition fails, then the corresponding fast or 
intermediate rates cannot be obtained in general. While several such results are available, 
they either focus on showing that, in the worst-case over all P £ V, no learning algorithm, 
proper or improper, can achieve a certain rate (in particular Audibert (2009) gives very 
general results), or that a particular proper learning algorithm such as ERM cannot achieve 
a certain rate (Mendelson, 2008a). Currently unexplored, it seems, are minimax results 
where one looks at the optimal (not just ERM) algorithm, but within the restricted class 
of all proper learning algorithms. 

In the spirit of Vapnik and Chervonenkis, who discovered under what conditions one 
can learn from a finite amount of data at all, we continue our quest for conditions under 
which one can learn from data using not too many examples. 
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Appendix A. Additional Proofs 
A.l Proof of Theorem 3.10 in Section 3 

Proof We first consider the case that Assumption A holds, and then the case of bounded 
loss. 

Under Assumption A. Under our Assumption A, we can, for each P G P, define 4‘{P) := 
f* G F to be optimal in the sense of (3). Note that f* depends on P, but not on any H. 
Since we also assume the weak i^-pseudoprobability convexity condition, we must have that 
for every e > 0, the T/-pseudoprobability convexity condition holds up to e for some function 
(/>£. It follows that for all e > 0, 'Ez^p[£f*iZ)] < PiZ’~^p[(-,j>^{p){Z)\ < 'EiZ’~^p[tpi^{Z)] + e, so 
that also 

E [^/*(Z)]< E K(Z)] (56) 

for all n G A(P). Now fix arbitrary P G P, let f* = (p{P) and let / G P be arbitrary and 
consider the special case that IT = (1 — X)6f* + X6f for A G [0, ^], where 6f is a point-mass 
on /. Let 

x(A, z) = rim^(z) = — log ^(1 — 
be the corresponding mix loss multiplied by tj, and let 

X(A)= E [x(A,Z)]=r? E K(Z)] 

be its expected value. Then from (56) it follows that x(^) is minimized at A = 0, which 
implies that the right-derivative x^(0) a-t 0 is nonnegative: 

x'(0) > 0. (57) 

In order to compute x^(0)) we first observe that, for any z, x(-^) z) is convex in A, because it 
is the composition of the negative logarithm with a linear function. Convexity of x(-^) z) in 
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A implies that the slope s{d, z) = — xi9S .. jg non-decreasing in d G (0, and achieves 
its maximum value at d = 1 / 2 , where it never exceeds 2 log 2 : 


s( 1 / 2 , z)=2 log l^-r,lf{z) - ^ = 2 log 2 . 


Hence 'Eiz^p[s{\, Z)] < 2 log 2 < oo and by the monotone convergence theorem (Shiryaev, 
1996) 


X'(0) = lim E [s(d,Z)] 

Cu|.U Zd'^r 


E Wm. sid,Z) 

Zr^P diO 


^x(A,^)|a=o 

= 1- E 

g-)?£/(Z) 




(58) 


Together with (57) and the fact that </>(P) = f* and that P was chosen arbitrarily, this 
implies the strong r/-central condition as required. 


When the Loss is Bounded. Let P G "P be arbitrary. The ? 7 -pseudoprobability convexity 
condition implies that for any 7 > 0 we can find /* G P such that 


E 


[£/*(^)]< E K(Z)]+7 

Z/~i^ 


for all distributions H G A(P). Choose any / G P and consider again the special case 
n = (1 — X)5f* -|- A(5j for A G [0, ^], which gives 


a(0) < X(A) + 77 (59) 

for x(A) as above. This time x(0) is not necessarily the exact minimum of x(A), but (59) 
expresses that it is close. To control x^(0), we use that 


a(A, z) = x(0, z) + A^x(0, z) + 5 A^^x(C, z) for some ^ G [0, A] 
by a second-order Taylor expansion in A, which implies that 


;^2 

X(A) - x(0) - Ax'(O) < — max 


g-¥f*iz) _ ^-rjifiz) 

2 Wl - 


A 




Together with (59) the choice A = (which requires 7 < 1/4) then allows us to conclude 
that 


-77 < x(\/7) - X(0) < V7X'(0) I (e’'^^ - l)^ 
x'( 0 ) > -c /7 

for c = 7 -|- \ — 1)^. Since (58) still holds, taking 7 small enough that 1 -|- c ^/7 < 

gives us the central condition ( 12 ) for any e > 0 . ■ 
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A.2 Proof of Lemma 4.16 in Section 4 

Proof Theorem 6.1 of Griinwald and Dawid (2004), itself a direct consequence of a minimax 
theorem due to Ferguson (1967), states the following; if a set of distributions V is convex, 
tight and closed in the weak topology, and L: x Tj —)> M is a function such that, for all 
/, L{z, f) is bounded from above and upper semi-continuous in z, then 

sup inf E [L(Z,/)] = inf sup E E [L(Z,/)]. (60) 

Pev f&Pd peA{j-d) Per f^P 

Let n G A(Td) be arbitrary, and observe that /) i® related to ^zj via 

5n(^>/)= E [ezj], 


so we will aim to apply (60) with L{z, f) approximately equal to Although ^zj is not 
necessarily bounded above, rewriting 




,-r)£g{z) 


we hnd that it is continuous in z, because if{z) is continuous in z and n [e is also 

continuous in z by continuity of ig{z) and the dominated convergence theorem (Shiryaev, 
1996), which applies because < 1. Letting aAb denote the minimum of a and b, 

it follows that ^z,f A 6 is also continuous in z for any number b. 

Thus we can apply (60) to the function L{z, f) = ^zj A b, with V the closure of V in 
the weak topology, to obtain 


inf sup E E [^z / A 6] < 
pe A( j-d)-PeP /~P 


inf sup E E \Fz /■ A 61 
peA(j-d)PeT^~^/~P 


We will show that 


sup inf E [^zjA6 ]. 
PeP /ePd 

(61) 


sup inf E [Cz/A6] < sup inf E [Cz/A6]. (62) 

If V is closed itself (hrst possibility in D.4), then V = V and this is immediate. The second 
possibility will be covered at the end of the proof. 

Together, (61) and (62) imply that 

inf sup E E [izj A 6] < sup inf E A 6] < sup inf ^ [^zj] 
peA(Pd) -PeP PeP /ePd PeP /ePd 

for any finite 6. We will show that, for every e > 0, there exists a 6 such that 


E E [^zj A 6] > E E [^zj] - e for all p G A(Td) and P G P. (63) 

Zp^r J'^p Z'^y J'^p 

By letting e tend to 0, we can therefore conclude that 


sup inf E [^zj]> inf sup E E = inf sup E [^zj], (64) 

PeP /ePd peA(Pd) PeP -^^-P /~p /ePd PeP 
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where the identity follows from the requirement that is convex in /, which implies 

that ^zj is also convex in /, and hence the mean of p is always at least as good as p itself: 
^z,Ef^p[f] < Since the sup inf never exceeds the inf sup, (64) implies (32), which 

was to be shown. 

To prove (63), we observe that 


E E {izj ^b]> E E \izj Ife/ < 6|1 = E E Kzj] - E E [fe, > 6|], 

Z'^P j^p Z'^P j'^p Z'^P j'^p Z'^P j'^p 


and, by uniform integrability, we can take b large enough that Ezr^.p Efz^p[^z,f l_^zj ^ ^1]< 
e for all p and P, as required. 

Finally, it remains to establish (62) for the second possibility in Assumption D.4. To 
this end, let e > 0 be arbitrary and let Z' <Z Z he a, compact set such that P{Z') > 1 — e 
for all P G "P. In addition, let 5 > 0 be small enough that 


sup \(-f{z) — £g{z)\ < e for all /,5 G Td such that d{f,g) < 6, 
z&Z' 

which is possible by the assumption of uniform equicontinuity. Since Pd is totally bounded, 
it can be covered by a finite number of balls of radius 6. Let Pd T Pd be the (finite) set of 
centers of those balls. Then we can bound the left-hand side of (62) as follows; 

sup inf E [L{Z,f)] < sup min E [L{Z,f)] = sup min E [L(Z,/)], 

PeT f&P Per /ePk PeP /ePd 

where the equality holds by continuity of Ez^p[L{Z, /)] and hence min^g^^^ Ez^p[L{Z, /)] 
in P. We now need to relate Pd back to Pd, which is possible because, for every f G Pd, 
there exists / G Pd such that d{f,f) < 5 and hence \ij{z) — if{z)\ < e for all z G Z'. It 
follows that L{z, f) < e^^L[z,f) and therefore 


sup min E [L{Z, /)] < sup min E [{Z G Z']j L(Z, /)] + eb 
Per /ePd PeP /ePd 

< sup inf E [IZ G Z'\ L{Z, /)] + eb < sup inf E [L{Z, /)] + eb, 
Per fePd Per fePd 

and letting e tend to 0 we obtain (62), which completes the proof. ■ 


A.3 Proof of Theorem 5.4 in Section 5 
Proof We prove the two cases in turn. 

Bernstein => Central. Fix arbitrary P G V, and let f* be P-optimal, i.e. satisfying (3). 
In this part of the proof, all expectations E are taken over Z ~ P. 

Suppose that the rt-Bernstein condition holds. Fix arbitrary / G P and let X = £f{Z) — 
if*{Z). Let e > 0 and set p = v{e) < c\e/u{e). We deal with e = 0 later and for now focus 
on the case e > 0, which implies p > 0. Then Lemma 5.6, applied to the random variable 
pX, gives 

E[X] + ^ logEle"'?^] < K{2ba)pYar{X) < K{2ba)pu{E[X]) < ^m(E[A]). 
p u{e) 
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If e < E[X], then the assumption that is non-increasing in e implies that 

and we can conclude that ^ logE[e“’^'’‘‘] < 0 < e. This inequality establishes (b), and it 
establishes (a) for the case 0 < e < E[X]. If e > E[X], then the assumption that u is 
non-decreasing implies that 

and, using that E[X] > 0, we again find that MogE[e“^^] < e, as required for (a). To 
finish the proof of (a) we now consider e = 0. If we also have i;(0) = 0 then the central 
condition ( 12 ) holds trivially for e = 0 , so we may assume without loss of generality that 
n(0) > 0. Then we must have r] = i;(0) = \\va.m.ix^Qx/u{x) > 0. Now hx a decreasing 
sequence {ej}j=i^ 2 ,... tending to 0, where the Sj are all positive and let rjj = v{ej). By the 
argument above, the r/j-central condition holds up to Sj. This implies (Fact 3.4) that for all 
j, all T] < r]j, in particular for rj = i;(0), the ry-central condition also holds up to £j. Thus, 
the r/-central condition holds up to e for all e > 0. By Proposition 3.11 it then follows that 
the strong ly-central condition holds, i.e. it also holds for e = 0 . 


Pseudoprobability => Bernstein. Suppose that the f-PPC condition holds. Fix some e > 0 
and let ly = n(e). Fix arbitrary P € V and let f* be T'-optimal for P, achieving (3). 
Fix arbitrary f € IF and let 11 be the distribution on F assigning mass 1/2 to f* and 
mass 1/2 to /, and let / G {/,/*} be the corresponding random variable. For z £ Z, let 
^zj = and let = rpMogEj^n 

variable under distribution 11 (not P, since z is fixed), and that 


Note that j is a random 


E[Y^j] = lvmz)-ip{z)) 

/~n ^ 


(67) 


Lemma 5.6 then gives, for each z £ Z, 


K(-2a5) Var[K d < E [K d + log E 

/~n /~n /~n l 


e-^-/ 


= :;Vi^fiz) - ipiz))+ r]e^, ( 68 ) 


where we used the definition of 11 and e^. We may assume from the definition of the 
n-pseudoprobability convexity condition that (15) holds for the given e and rj and Ft; rear¬ 
ranging this equation it is seen to be equivalent to Ezr...p[£z] < £• By taking expectations 
over Z on both sides of (68) this gives 


K{-2ab) E Var [Yz] < Jiy E [if{Z) - if,{Z)] + rje. (69) 

Z'^F /~n Z Z'^F 

The 11-variance on the left can be rewritten, using (67), as 

Var [y;y = 1 (,«,(.) - G(^)) - Kf]^ + ) (. ■ 0 - Kf])^ 

= \(\^{^f(.z)-£f*{z))^ +^(^-^v{^f{z)-^p{z))^ =^r]^i£f{z)-£f*iz)f 
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Plugging this into (69) and dividing both sides by rf /{'iK{—2ah)) gives 


f - t,.(Z)] + 2e) . (70) 

This holds for all e > 0 and rj = v{e), as long as rj = u(e) > 0 (if r/ = 0 we cannot divide by if' 
to go from (69) to (70)). Thus, we may set e = 'EjZr^^p — ^/*(-^)] > 0; if r/ = u{e) > 0 

then (70) must hold for e. With these values the right-hand side becomes Qrj~^ k~^{ 2ab)£ = 
C 2 e/u(e) = u(e), and the result follows by our choice of e. It remains to deal with the case 
r/ = 0, which by definition of v can only happen if e = ^Zr^p — £f*{Z)] = 0. In this 

case, (70) still holds for all values of e > 0. We thus infer that the left-hand side of (70) is 
bounded by infe>o 4e/(K(— 2 a 6 )u(e), and the result follows by our definition of 0/u(0). ■ 


A.4 Proofs for Section 7 

Lemma A.l (Hyper-Concentrated Excess Losses) Let Z be a random variable with 
probability measure P supported on [—V,V]. Suppose that lim^_,.oo E[exp(—r/Z)] < 1 and 
E[Z] = Pi > 0. Then there is a suitable modification Z' of Z for which Z' < Z with 
probability 1, the mean of Z' is arbitrarily close to fi, and E[exp(— t^Z')] = 1 for arbitrarily 
large rj. 


Proof First, observe that Z > 0 a.s. If not, then there must be some finite rj > 0 for 
which E[exp(—r/Z)] = 1. Now, consider a random variable Z' with probability measure Q^, 
a modification of Z (with probability measure P) constructed in the following way. Define 
A := [p,, V] and A~ := [—V, —p]. Then for any e > 0 we dehne Qs as 


dQe{z) 


(1 — £)dP{z) if z G A 
< £dP{—z) if z e A~ 

_ dP( 2 ;) otherwise. 


Additionally, we couple P and Qe such that the couple (Z, Z') is a coupling of {P, Qe) 
satisfying 


E IZPZ'}= min E ^Z 7 ^Z' 11 , 

{Z,Z'P(P,Q,) (P',Q’,) {Z,Z'P(P',Q'A 

where the min is over all couplings of P and Qe- This coupling ensures that Z' < Z with 
probability 1; i.e. Z' is dominated by Z. 
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Now, 


E[exp(-r/Z')] = f e '^''dQe{z) 

J-V 

= [ e-^MQ,(z) + [ e-^^dQ,{z) + [ e-^^dQe{z) 

JA- JA J[oy]\A 

= e f e-^^dP{-z) + (1 - e) f e-^MP(z) + f e-^^dP{z) 
Ja- Ja J[0,V]\A 

= e f e^^dP{z) + (1 - e) f e-^^dP{z) + 

Ja Ja 


[0,V]\A 

>ee^^P{A) + {l-e) [ e-^^dP{z) + f e-^^dP{z). 

Ja J[o,v]\a 


e-^^dP{z) 


(71) 


Now, on the one hand, for any rj > 0, the sum of the two right-most terms in (71) is 
strictly less than 1 by assumption. On the other hand, r] -A- £P{A)e^^ is exponentially in¬ 
creasing since e > 0 and /r > 0 (and hence P{A) > 0 as well) by assumption; thus, the hrst 
term in (71) can be made arbitrarily large by increasing rj. Consequently, we can choose 
e > 0 as small as desired and then choose r/ < oo as large as desired such that the mean of 
Z' is arbitrarily close to /r and E[exp(—i^Z')] = 1 respectively. ■ 


Proof (of Lemma 7.2) Let W denote the convex hull of (/([—I,!]). We need to see if 
(—^, l) G W. Note that W is the convex set formed by starting with the graph of a: i—)• 
^ on the domain [— 1 , 1 ], including the line segment connecting this curve’s endpoints 
(— 1 , e“^*) to ( 1 , and including all of the points below this line segment but above the 

aforementioned graph. That is, W is precisely the set 


W = Ux,y) G <y< 


+ e 


-V 


aV — p-V 


+ 


-X, X G [—1,1] 


We therefore need to check that — 1 < — - < 1 and that 1 is sandwiched between the lower 

— n — 

and upper bounds at x = — Clearly — 1 < — ^ < 1 holds since the loss is in [0,1] by 

* * * * 

assumption. Using that cosh(r/*) = and sinh(7y*) = , this means that 

A: G IT if and only if 

g-T? a/n < ]^ < cosh(r/*) -|- sinh( 77 *) — 

n 

Also, since a > 0 the inequality < l holds with strict inequality. Thus, we end up 

with a single requirement characterizing when k G VU, which is equivalent to condition (53). 
Moreover, k G int W is characterized by when (53) holds strictly. ■ 


Proof (of Theorem 7.3) By assumption, the condition of Lemma 7.2 is satisfied, so we can 
apply Theorem 3 of Kemperman (1968). This gives 


— exp 



> do 


— Cli -|- 02, 

n 


(72) 
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for all d* = {do,di,d 2 ) G such that 

do + dis + d2e^ * + < 0 for all s G [—1,1]. (73) 

To find a good choice of d*, we will restrict attention to those d* for which (73) holds with 
equality at s = 0 , yielding the constraint 

do = -d2-l. (74) 

Plugging this into (73) and changing variables to ci = —di/r],^^ and C 2 = —^ 2 , we obtain 
the constraint 

u{s) := 1 + C 2 (e^^ “ 1 ) “ + rjcis > 0 for all s G [— 1 , 1 ]. 


A.4.1 Constraints from the Local Minimum at 0 

Since u(0) = 0, we need s = 0 to be a local minimum of u, and so we require the first and 
second derivative to satisfy 

(a) m'(0 ) = 0 

(b) u"{0) > 0 , 

since otherwise there exists some small e > 0 such that either u{e) < 0 or u{—e) < 0 . 

For (a), we compute 

u{s) = 1/026^® — + rjci- 

Since we require u'(0) = 0, we pick up the constraint 

v(c2-^+ ci^ = 0, 

and since 77 > 0 by assumption, we have 

Cl = ^ - C2. (75) 

Thus, we can eliminate ci from u{s): 

u{s) = 1 + C2(e’?® - 1 ) - ^ 



For (b), observe that 

n"(s) = r)^C2e^^ - 

so that u"( 0 ) = (c 2 — j) > 0 , and hence we require 

1 


(76) 


10. We scale by r; here because we are chasing a certain r;-dependent rate. 
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A.4.2 The Other Minima of u 


Thus far, we have picked up the constraints (74), (75), and (76), and it remains to choose 
a value of C 2 such that u(s) > 0 for all s G [—1,1]. To this end, observe that u'{s) has at 
most two roots, because with the substitution y = we have 


u'{s) = r]C 2 y‘^ - + 7 ? Q - C 2 ^ , 


which is a quadratic equation in y with two roots: 


2 /e 


1-2c2 

2 C 2 



s G 



1-2c2 

2 C 2 



Now, since we are taking C 2 > |, the first root is negative, and we find that u is non¬ 
decreasing on [ 0 , 1 ]. As we already ensured that n( 0 ) = 0 , this means that u is non-negative 
on [0,1]. On the remaining interval, [—1,0], we know that u is increasing up to | log 
and then decreasing until s = 0. Since u(0) = 0, we therefore need to ensure only that 
u(— 1 ) > 0 by finding appropriate conditions on C 2 , where 


u(-l) = 1 + C 2 (e - 1) - e y2~ 

= (l - I) - + C2 (e-’ - (1 - v)) 

' e-”-'" + i - 1 1 >^(-' 1 / 2 ) 

e~^ + ?] — 1 4 K{—r]) ’ 

where k{x) = (e* — x — l)/x^ is increasing in x, which implies that this condition always 
ensures that C 2 > 1/4. 

We consider the cases y < I and y > 1 separately. 


Case y <1. 


For y < 1, we will take the value of the constraint at 77 = 1. That is. 


_l <-l/2) _ ^2 

4 r(-1) 


e 

2 ' 


This is allowed because is non-decreasing, as may be verified by observing that 

d + 3 _ 1 ^ _ 1 + 

dy e~^ + y — 1 2(1-|-e 3(?7 — 1))2 ’ 


which is non-negative if g{y) = — 1 -|- e ^/^77 > 0. This in turn is verified by noting that 

g'(O) = 0 and g'{y) = | — 1) is positive. 

Case y > 1. Let C 2 = ^ ^ for some a > 0. With this substitution, we have 


u{-l) 


1 + C2(e-’' - 1) - - =2) 

'+“ i) “ “ 
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Since we want the above to be nonnegative for all r/ > 1, we arrive at the condition 


a < inf 


l+e-” _ p-nl2 
2 ^ 


^>1 1- 


( 77 ) 


Plotting suggests that the minimum is attained at rj = 1, with the value ^{^/e — 1)^ = 
0.2104.... We will fix a to this value and verify that 


1 + e 


-»7 


— e 


-v/2 


+ -1 + ^(1 -«■”)) > 0 - 


(78) 


This is true with equality at ry = 0. The derivative of the LHS with respect to r/ is 

— 1 — n — 


fe^/2 _ 1 _ (v^-1) (e^-r]-l) ) 

2 V 7^ y ■ 


The derivative is positive at ry = 1, so 0 is a candidate minimum. Eventually, ^ 

grows more quickly than — 1 and surpasses the latter in value. The derivative is therefore 
negative for all sufficiently large ry, and so we need only take the minimum of the LHS of 
(78) evaluated at ry = 1 and the limiting value as 7y —)■ 00. We have 


,’ 4 s= ^ 7 - «t)=V i - 1 > 0. 

Hence, (78) indeed holds for a < 0.21 < ^{y/e — 1)^. We conclude that u(—1) > 0 when 
a < li^/e - 1 )^. 


A.4.3 Putting it All Together 

Tracing back our substitutions, we have do + <^2 = — 1 and di = —rj/2 + ryc2, which gives 

do - -di + d2 = -1 + — - C2^ > -e“y^*^2-‘=2)_ 

n n \l J 

In the regime ?y < 1, we choose C2 = — e/2, which leads to 

a 0.2177a 

do-di + d2 > —e " . (79) 

n 

In the regime ry > 1, we take C 2 = \ — ^(\/e — 1)^, which gives 

do — —di + d2 > —e“2s. (80) 

n 

Combining with (72) leads to the desired result. ■ 


Proof (of Corollary 7.4) Define the function r(7y) := negative excess loss 

random variable S', let r]s' be the maximum rj for which —S' is stochastically mixable. 
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Let VF be a stochastically mixable excess loss random variable taking values in [—1,1] 
and satisfying E[kF] = T{r]s) > 0, and let S = —W be the corresponding negative excess 
loss random variable. 

Let ks € be the moments vector of S, defined as 


ks := 


( E[5] \ 



Because — E[S'] = T{r]s), from Lemma 7.2 the point ks is extremal with respect to 
co(5([—1,1])). Recall that the goal of this proof is to establish that Theorem 7.3 holds even 
for the extremal random variable S. 

Since E[5] < 0, there exists ^ C {x G M: x < 0} for which we have Pr(S £ A) =\ p > Q. 
Now, consider the following two perturbed versions of S', which we call (I) and (II). In both 
perturbations, we deflate Pr(S G ^4) by the same (multiplicative) factor e > 0 uniformly 
over A so that the overall loss in probability mass over A is e; this is always possible for 
small enough e since p > 0, and throughout the rest of the proof we keep implicit that e is 
suitably small. The perturbations differ in where they allocate the mass taken from A: 

(I) Allocate e additional mass to |. 

(II) Allocate | additional mass to ^ and | additional mass to 1. 

We refer to these new random variables as Sj and S'//. Observe that 

E[S/] = E[S//] > E[S] + ^e. 


Because E[S/] = E[S//], it follows that if we can show that rjsj / r]Sii, then ksj and ksjj 
cannot both are extremal since T is strictly increasing. 

Now, by definition, E exp (r/^^S/) = 1. But observe that by strict convexity, for any 
r] > 0, we have 

e3W4 < 1 



Therefore, E[exp (ry^^S/)] > 1, and so r]Sii < VSi- Therefore, ksj cannot be extremal, and 
Theorem 7.3 can be applied to the excess loss random variable —S/. 

Now, for each (suitably small) e, we refer to the corresponding S/ more precisely via 
the notation S^, and we define := rjSe- Since for all e > 0, 


exp 



< exp I — 


and since for each S^ we have 


E 



< l-0.21(r?,Al)E[-Se], 


from the dominated convergence theorem it follows that 


E 



< l-0.21(r?5Al)E[-S], 
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i.e. using the familiar notation rj* = ijs- 


E 


exp 



< 1 - 0 . 21 ( 77 * A1)E[1E]. 


Proof (of Corollary 7.5) Let X be a random variable taking values in [—V, V] with mean 

— ^ and E[e^'’^] = 1, and let E be a random variable taking values in [—1,1] with mean 

— and = 1. Consider a random variable X that is a ^-scaled independent 

copy of X] observe that E[X] = — and = 1. Let the maximal possible value 

of be bx, and let the maximal possible value of be by- We claim that 

bx = by. Let X be a random variable with a distribution that maximizes sub¬ 
ject to the previously stated constraints on X. Since X satisfies = hx^ setting 

Y = X shows that in fact by > bx- A symmetric argument (starting with Y and passing 
to some Y = VY) implies that bx > by. ■ 


Proof (of Theorem 7 . 6 ) Let 7n = 77 for a constant a to be fixed later. For each 77 > 0, 
iv) 

let C correspond to those functions in for which 77 is the largest constant 
such that E[exp(—r7VF7-)] = 1 . Let C Xy^ correspond to functions / in Xy^ for 

which lim^^oo E[exp(-r7lF7)] < 1. Clearly, Xy^ = (Ur,e[r;*,oo) U The excess 

loss random variables corresponding to elements / G Xy^^‘^'^ are ‘hyper-concentrated’ in the 
sense that they are infinitely stochastically mixable. However, Lemma A.l above shows 
that for each hyper-concentrated Wf, there exists another excess loss random variable Wj 
with mean arbitrarily close to that of Wf, with E[exp(—77Wj)] = 1 for some arbitrarily 
large but finite 77, and with Wf <Wf with probability 1 . The last property implies that the 
empirical risk of Wf is no greater than that of VFy; hence for each hyper-concentrated Wf it 
is sufficient (from the perspective of ERM) to study a corresponding Wf. From now on, we 

implicitly make this replacement in Xy^ itself, so that we now have Xy^ = Ur;e[r?* 00) ■ 

Consider an arbitrary a > 0. For some fixed 77 G [77*,00) for which > 0, consider 

(v) 

the subclass Xy„ . Individually for each such function, we will apply Lemma 7.1 as follows. 
From Lemma 7 . 5 , we have A.-Wj{r]/ 2 ) = A_j_^^(1^77/2). From Corollary 7 . 4 , the latter is 

at most = _ 0:21?^ ^ Hence, Lemma 7.1 with t = 0 and the 77 from the 

lemma taken to be 77/2 implies that the probability of the event Pni{-,f) < Pn^{-,f*) is 
at most exp Applying the union bound over all of JFy„, we conclude that 

Pr{3/ € X,. < iVexp (-p* . 

Since ERM selects hypotheses on their empirical risk, from inversion it holds that with 
probability at least 1 — 5 ERM will not select any hypothesis with excess risk at least 

5max|v",^}(log J-l-logAf) 
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